git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/24] pack-objects: multi-pack verbatim reuse
@ 2023-11-28 19:07 Taylor Blau
  2023-11-28 19:07 ` [PATCH 01/24] pack-objects: free packing_data in more places Taylor Blau
                   ` (26 more replies)
  0 siblings, 27 replies; 107+ messages in thread
From: Taylor Blau @ 2023-11-28 19:07 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano

Back in fff42755ef (pack-bitmap: add support for bitmap indexes,
2013-12-21), we added support for reachability bitmaps, and taught
pack-objects how to reuse verbatim chunks from the bitmapped pack. When
multi-pack bitmaps were introduced, this pack-reuse mechanism evolved to
use the MIDX's "preferred" pack as the source for verbatim reuse.

This allows repositories to incrementally repack themselves (e.g., using
a `--geometric` repack), storing the result in a MIDX, and generating a
corresponding bitmap. This keeps our bitmap coverage up-to-date, while
maintaining a relatively small number of packs.

However, it is recommended (and matches what we do in production at
GitHub) that repositories repack themselves all-into-one, and
generate a corresponding single-pack reachability bitmap. This is done
for a couple of reasons, but the most relevant one to this series is
that it enables us to perform verbatim pack-reuse over a complete copy
of the repository, since the entire repository resides in a single pack
(and thus is eligible for verbatim pack-reuse).

As repositories grow larger, packing their contents into a single pack
becomes less feasible. This series extends the pack-reuse mechanism to
operate over multiple packs which are known ahead of time to be disjoint
with respect to one another's set of objects.

The implementation has a few components:

  - A new MIDX chunk called "Disjoint packfiles" or DISP is introduced
    to keep track of the bitmap position, number of objects, and
    disjointed-ness for each pack contained in the MIDX.

  - A new mode for `git multi-pack-index write --stdin-packs` that
    allows specifying disjoint packs, as well as a new option
    `--retain-disjoint` which preserves the set of existing disjoint
    packs in the new MIDX.

  - A new pack-objects mode `--ignore-disjoint`, which produces packs
    which are disjoint with respect to the current set of disjoint packs
    (i.e. it discards any objects from the packing list which appear in
    any of the known-disjoint packs).

  - A new repack mode, `--extend-disjoint` which causes any new pack(s)
    which are generated to be disjoint with respect to the set of packs
    currently marked as disjoint, minus any pack(s) which are about to
    be deleted.

With all of that in place, the patch series then rewrites all of the
pack-reuse functions in terms of the new `bitmapped_pack` structure.
Once we have dropped all of the assumptions stemming from only
performing pack-reuse over a single candidate pack, we can then enable
reuse over all of the disjoint packs.

In addition to the many new tests in t5332 added by that series, I tried
to simulate a "real world" test on git.git by breaking the repository
into chunks of 1,000 commits (plus their set of reachable objects not
reachable from earlier chunk(s)) and packing those chunks. This produces
a large number of packs with the objects from git.git which are known to
be disjoint with respect to one another.

    $ git clone git@github.com:git/git.git base

    $ cd base
    $ mv .git/objects/pack/pack-*.idx{,.bak}
    $ git unpack-objects <.git/objects/pack/pack-*.pack

    # pack the objects from each successive block of 1k commits
    $ for rev in $(git rev-list --all | awk '(NR) % 1000 == 0' | tac)
      do
        echo $rev |
        git.compile pack-objects --revs --unpacked .git/objects/pack/pack || return 1
      done
    # and grab any stragglers, pruning the unpacked objects
    $ git repack -d
    I then constructed a MIDX and corresponding bitmap

    $ find_pack () {
        for idx in .git/objects/pack/pack-*.idx
        do
          git show-index <$idx | grep -q "$1" && basename $idx
        done
      }
    $ preferred="$(find_pack $(git rev-parse HEAD))"

    $ ( cd .git/objects/pack && ls -1 *.idx ) | sed -e 's/^/+/g' |
        git.compile multi-pack-index write --bitmap --stdin-packs \
          --preferred-pack=$preferred
    $ git for-each-ref --format='%(objectname)' refs/heads refs/tags >in

With all of that in place, I was able to produce a significant speed-up
by reusing objects from multiple packs:

    $ hyperfine -L v single,multi -n '{v}-pack reuse' 'git.compile -c pack.allowPackReuse={v} pack-objects --revs --stdout --use-bitmap-index --delta-base-offset <in >/dev/null'
    Benchmark 1: single-pack reuse
      Time (mean ± σ):      6.094 s ±  0.023 s    [User: 43.723 s, System: 0.358 s]
      Range (min … max):    6.063 s …  6.126 s    10 runs

    Benchmark 2: multi-pack reuse
      Time (mean ± σ):     906.5 ms ±   3.2 ms    [User: 1081.5 ms, System: 30.9 ms]
      Range (min … max):   903.5 ms … 912.7 ms    10 runs

    Summary
      multi-pack reuse ran
        6.72 ± 0.03 times faster than single-pack reuse

(There are corresponding tests in p5332 that test different sized chunks
and measure the runtime performance as well as resulting pack size).

Performing verbatim pack reuse naturally trades off between CPU time and
the resulting pack size. In the above example, the single-pack reuse
case produces a clone size of ~194 MB on my machine, while the
multi-pack reuse case produces a clone size closer to ~266 MB, which is
a ~37% increase in clone size.

I think there is still some opportunity to close this gap, since the
"packing" strategy here is extremely naive. In a production setting, I'm
sure that there are more well thought out repacking strategies that
would produce more similar clone sizes.

I considered breaking this series up into smaller chunks, but was
unsatisfied with the result. Since this series is rather large, if you
have alternate suggestions on better ways to structure this, please let
me know.

Thanks in advance for your review!

Taylor Blau (24):
  pack-objects: free packing_data in more places
  pack-bitmap-write: deep-clear the `bb_commit` slab
  pack-bitmap: plug leak in find_objects()
  midx: factor out `fill_pack_info()`
  midx: implement `DISP` chunk
  midx: implement `midx_locate_pack()`
  midx: implement `--retain-disjoint` mode
  pack-objects: implement `--ignore-disjoint` mode
  repack: implement `--extend-disjoint` mode
  pack-bitmap: pass `bitmapped_pack` struct to pack-reuse functions
  pack-bitmap: simplify `reuse_partial_packfile_from_bitmap()` signature
  pack-bitmap: return multiple packs via
    `reuse_partial_packfile_from_bitmap()`
  pack-objects: parameterize pack-reuse routines over a single pack
  pack-objects: keep track of `pack_start` for each reuse pack
  pack-objects: pass `bitmapped_pack`'s to pack-reuse functions
  pack-objects: prepare `write_reused_pack()` for multi-pack reuse
  pack-objects: prepare `write_reused_pack_verbatim()` for multi-pack
    reuse
  pack-objects: include number of packs reused in output
  pack-bitmap: prepare to mark objects from multiple packs for reuse
  pack-objects: add tracing for various packfile metrics
  t/test-lib-functions.sh: implement `test_trace2_data` helper
  pack-objects: allow setting `pack.allowPackReuse` to "single"
  pack-bitmap: reuse objects from all disjoint packs
  t/perf: add performance tests for multi-pack reuse

 Documentation/config/pack.txt          |   8 +-
 Documentation/git-multi-pack-index.txt |  12 ++
 Documentation/git-pack-objects.txt     |   8 +
 Documentation/git-repack.txt           |  12 ++
 Documentation/gitformat-pack.txt       | 109 ++++++++++
 builtin/multi-pack-index.c             |  13 +-
 builtin/pack-objects.c                 | 200 +++++++++++++++----
 builtin/repack.c                       |  57 +++++-
 midx.c                                 | 218 +++++++++++++++++---
 midx.h                                 |  11 +-
 pack-bitmap-write.c                    |   9 +-
 pack-bitmap.c                          | 265 ++++++++++++++++++++-----
 pack-bitmap.h                          |  18 +-
 pack-objects.c                         |  15 ++
 pack-objects.h                         |   1 +
 t/helper/test-read-midx.c              |  31 ++-
 t/lib-disjoint.sh                      |  49 +++++
 t/perf/p5332-multi-pack-reuse.sh       |  81 ++++++++
 t/t5319-multi-pack-index.sh            | 140 +++++++++++++
 t/t5331-pack-objects-stdin.sh          | 156 +++++++++++++++
 t/t5332-multi-pack-reuse.sh            | 219 ++++++++++++++++++++
 t/t6113-rev-list-bitmap-filters.sh     |   2 +
 t/t7700-repack.sh                      |   4 +-
 t/t7705-repack-extend-disjoint.sh      | 142 +++++++++++++
 t/test-lib-functions.sh                |  14 ++
 25 files changed, 1650 insertions(+), 144 deletions(-)
 create mode 100644 t/lib-disjoint.sh
 create mode 100755 t/perf/p5332-multi-pack-reuse.sh
 create mode 100755 t/t5332-multi-pack-reuse.sh
 create mode 100755 t/t7705-repack-extend-disjoint.sh


base-commit: 564d0252ca632e0264ed670534a51d18a689ef5d
-- 
2.43.0.24.g980b318f98

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH 01/24] pack-objects: free packing_data in more places
  2023-11-28 19:07 [PATCH 00/24] pack-objects: multi-pack verbatim reuse Taylor Blau
@ 2023-11-28 19:07 ` Taylor Blau
  2023-11-30 10:18   ` Patrick Steinhardt
  2023-11-28 19:07 ` [PATCH 02/24] pack-bitmap-write: deep-clear the `bb_commit` slab Taylor Blau
                   ` (25 subsequent siblings)
  26 siblings, 1 reply; 107+ messages in thread
From: Taylor Blau @ 2023-11-28 19:07 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano

The pack-objects internals use a packing_data struct to track what
objects are part of the pack(s) being formed.

Since these structures contain allocated fields, failing to
appropriately free() them results in a leak. Plug that leak by
introducing a free_packing_data() function, and call it in the
appropriate spots.

This is a fairly straightforward leak to plug, since none of the callers
expect to read any values or have any references to parts of the address
space being freed.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c |  1 +
 midx.c                 |  5 +++++
 pack-objects.c         | 15 +++++++++++++++
 pack-objects.h         |  1 +
 4 files changed, 22 insertions(+)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 89a8b5a976..bfa60359d4 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -4522,6 +4522,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 			   reuse_packfile_objects);
 
 cleanup:
+	free_packing_data(&to_pack);
 	list_objects_filter_release(&filter_options);
 	strvec_clear(&rp);
 
diff --git a/midx.c b/midx.c
index 2f3863c936..3b727dc633 100644
--- a/midx.c
+++ b/midx.c
@@ -1592,8 +1592,13 @@ static int write_midx_internal(const char *object_dir,
 				      flags) < 0) {
 			error(_("could not write multi-pack bitmap"));
 			result = 1;
+			free_packing_data(&pdata);
+			free(commits);
 			goto cleanup;
 		}
+
+		free_packing_data(&pdata);
+		free(commits);
 	}
 	/*
 	 * NOTE: Do not use ctx.entries beyond this point, since it might
diff --git a/pack-objects.c b/pack-objects.c
index f403ca6986..1c7bedcc94 100644
--- a/pack-objects.c
+++ b/pack-objects.c
@@ -151,6 +151,21 @@ void prepare_packing_data(struct repository *r, struct packing_data *pdata)
 	init_recursive_mutex(&pdata->odb_lock);
 }
 
+void free_packing_data(struct packing_data *pdata)
+{
+	if (!pdata)
+		return;
+
+	free(pdata->cruft_mtime);
+	free(pdata->in_pack);
+	free(pdata->in_pack_by_idx);
+	free(pdata->in_pack_pos);
+	free(pdata->index);
+	free(pdata->layer);
+	free(pdata->objects);
+	free(pdata->tree_depth);
+}
+
 struct object_entry *packlist_alloc(struct packing_data *pdata,
 				    const struct object_id *oid)
 {
diff --git a/pack-objects.h b/pack-objects.h
index 0d78db40cb..336217e8cd 100644
--- a/pack-objects.h
+++ b/pack-objects.h
@@ -169,6 +169,7 @@ struct packing_data {
 };
 
 void prepare_packing_data(struct repository *r, struct packing_data *pdata);
+void free_packing_data(struct packing_data *pdata);
 
 /* Protect access to object database */
 static inline void packing_data_lock(struct packing_data *pdata)
-- 
2.43.0.24.g980b318f98


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 02/24] pack-bitmap-write: deep-clear the `bb_commit` slab
  2023-11-28 19:07 [PATCH 00/24] pack-objects: multi-pack verbatim reuse Taylor Blau
  2023-11-28 19:07 ` [PATCH 01/24] pack-objects: free packing_data in more places Taylor Blau
@ 2023-11-28 19:07 ` Taylor Blau
  2023-11-30 10:18   ` Patrick Steinhardt
  2023-12-12  7:04   ` Jeff King
  2023-11-28 19:08 ` [PATCH 03/24] pack-bitmap: plug leak in find_objects() Taylor Blau
                   ` (24 subsequent siblings)
  26 siblings, 2 replies; 107+ messages in thread
From: Taylor Blau @ 2023-11-28 19:07 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano

The `bb_commit` commit slab is used by the pack-bitmap-write machinery
to track various pieces of bookkeeping used to generate reachability
bitmaps.

Even though we clear the slab when freeing the bitmap_builder struct
(with `bitmap_builder_clear()`), there are still pointers which point to
locations in memory that have not yet been freed, resulting in a leak.

Plug the leak by introducing a suitable `free_fn` for the `struct
bb_commit` type, and make sure it is called on each member of the slab
via the `deep_clear_bb_data()` function.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-bitmap-write.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c
index f4ecdf8b0e..dd3a415b9d 100644
--- a/pack-bitmap-write.c
+++ b/pack-bitmap-write.c
@@ -198,6 +198,13 @@ struct bb_commit {
 	unsigned idx; /* within selected array */
 };
 
+static void clear_bb_commit(struct bb_commit *commit)
+{
+	free(commit->reverse_edges);
+	bitmap_free(commit->commit_mask);
+	bitmap_free(commit->bitmap);
+}
+
 define_commit_slab(bb_data, struct bb_commit);
 
 struct bitmap_builder {
@@ -339,7 +346,7 @@ static void bitmap_builder_init(struct bitmap_builder *bb,
 
 static void bitmap_builder_clear(struct bitmap_builder *bb)
 {
-	clear_bb_data(&bb->data);
+	deep_clear_bb_data(&bb->data, clear_bb_commit);
 	free(bb->commits);
 	bb->commits_nr = bb->commits_alloc = 0;
 }
-- 
2.43.0.24.g980b318f98


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 03/24] pack-bitmap: plug leak in find_objects()
  2023-11-28 19:07 [PATCH 00/24] pack-objects: multi-pack verbatim reuse Taylor Blau
  2023-11-28 19:07 ` [PATCH 01/24] pack-objects: free packing_data in more places Taylor Blau
  2023-11-28 19:07 ` [PATCH 02/24] pack-bitmap-write: deep-clear the `bb_commit` slab Taylor Blau
@ 2023-11-28 19:08 ` Taylor Blau
  2023-12-12  7:04   ` Jeff King
  2023-11-28 19:08 ` [PATCH 04/24] midx: factor out `fill_pack_info()` Taylor Blau
                   ` (23 subsequent siblings)
  26 siblings, 1 reply; 107+ messages in thread
From: Taylor Blau @ 2023-11-28 19:08 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano

The `find_objects()` function creates an object_list for any tips of the
reachability query which do not have corresponding bitmaps.

The object_list is not used outside of `find_objects()`, but we never
free it with `object_list_free()`, resulting in a leak. Let's plug that
leak by calling `object_list_free()`, which results in t6113 becoming
leak-free.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-bitmap.c                      | 2 ++
 t/t6113-rev-list-bitmap-filters.sh | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index 0260890341..d2f1306960 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -1280,6 +1280,8 @@ static struct bitmap *find_objects(struct bitmap_index *bitmap_git,
 		base = fill_in_bitmap(bitmap_git, revs, base, seen);
 	}
 
+	object_list_free(&not_mapped);
+
 	return base;
 }
 
diff --git a/t/t6113-rev-list-bitmap-filters.sh b/t/t6113-rev-list-bitmap-filters.sh
index 86c70521f1..459f0d7412 100755
--- a/t/t6113-rev-list-bitmap-filters.sh
+++ b/t/t6113-rev-list-bitmap-filters.sh
@@ -4,6 +4,8 @@ test_description='rev-list combining bitmaps and filters'
 . ./test-lib.sh
 . "$TEST_DIRECTORY"/lib-bitmap.sh
 
+TEST_PASSES_SANITIZE_LEAK=true
+
 test_expect_success 'set up bitmapped repo' '
 	# one commit will have bitmaps, the other will not
 	test_commit one &&
-- 
2.43.0.24.g980b318f98


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 04/24] midx: factor out `fill_pack_info()`
  2023-11-28 19:07 [PATCH 00/24] pack-objects: multi-pack verbatim reuse Taylor Blau
                   ` (2 preceding siblings ...)
  2023-11-28 19:08 ` [PATCH 03/24] pack-bitmap: plug leak in find_objects() Taylor Blau
@ 2023-11-28 19:08 ` Taylor Blau
  2023-11-30 10:18   ` Patrick Steinhardt
  2023-11-28 19:08 ` [PATCH 05/24] midx: implement `DISP` chunk Taylor Blau
                   ` (22 subsequent siblings)
  26 siblings, 1 reply; 107+ messages in thread
From: Taylor Blau @ 2023-11-28 19:08 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano

When selecting which packfiles will be written while generating a MIDX,
the MIDX internals fill out a 'struct pack_info' with various pieces of
book-keeping.

Instead of filling out each field of the `pack_info` structure
individually in each of the two spots that modify the array of such
structures (`ctx->info`), extract a common routine that does this for
us.

This reduces the code duplication by a modest amount. But more
importantly, it zero-initializes the structure before assigning values
into it. This hardens us for a future change which will add additional
fields to this structure which (until this patch) was not
zero-initialized.

As a result, any new fields added to the `pack_info` structure need only
be updated in a single location, instead of at each spot within midx.c.

There are no functional changes in this patch.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 midx.c | 35 +++++++++++++++++++----------------
 1 file changed, 19 insertions(+), 16 deletions(-)

diff --git a/midx.c b/midx.c
index 3b727dc633..591b3c636e 100644
--- a/midx.c
+++ b/midx.c
@@ -464,6 +464,17 @@ struct pack_info {
 	unsigned expired : 1;
 };
 
+static void fill_pack_info(struct pack_info *info,
+			   struct packed_git *p, char *pack_name,
+			   uint32_t orig_pack_int_id)
+{
+	memset(info, 0, sizeof(struct pack_info));
+
+	info->orig_pack_int_id = orig_pack_int_id;
+	info->pack_name = pack_name;
+	info->p = p;
+}
+
 static int pack_info_compare(const void *_a, const void *_b)
 {
 	struct pack_info *a = (struct pack_info *)_a;
@@ -504,6 +515,7 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
 			     const char *file_name, void *data)
 {
 	struct write_midx_context *ctx = data;
+	struct packed_git *p;
 
 	if (ends_with(file_name, ".idx")) {
 		display_progress(ctx->progress, ++ctx->pack_paths_checked);
@@ -530,17 +542,14 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
 
 		ALLOC_GROW(ctx->info, ctx->nr + 1, ctx->alloc);
 
-		ctx->info[ctx->nr].p = add_packed_git(full_path,
-						      full_path_len,
-						      0);
-
-		if (!ctx->info[ctx->nr].p) {
+		p = add_packed_git(full_path, full_path_len, 0);
+		if (!p) {
 			warning(_("failed to add packfile '%s'"),
 				full_path);
 			return;
 		}
 
-		if (open_pack_index(ctx->info[ctx->nr].p)) {
+		if (open_pack_index(p)) {
 			warning(_("failed to open pack-index '%s'"),
 				full_path);
 			close_pack(ctx->info[ctx->nr].p);
@@ -548,9 +557,8 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
 			return;
 		}
 
-		ctx->info[ctx->nr].pack_name = xstrdup(file_name);
-		ctx->info[ctx->nr].orig_pack_int_id = ctx->nr;
-		ctx->info[ctx->nr].expired = 0;
+		fill_pack_info(&ctx->info[ctx->nr], p, xstrdup(file_name),
+			       ctx->nr);
 		ctx->nr++;
 	}
 }
@@ -1310,11 +1318,6 @@ static int write_midx_internal(const char *object_dir,
 		for (i = 0; i < ctx.m->num_packs; i++) {
 			ALLOC_GROW(ctx.info, ctx.nr + 1, ctx.alloc);
 
-			ctx.info[ctx.nr].orig_pack_int_id = i;
-			ctx.info[ctx.nr].pack_name = xstrdup(ctx.m->pack_names[i]);
-			ctx.info[ctx.nr].p = ctx.m->packs[i];
-			ctx.info[ctx.nr].expired = 0;
-
 			if (flags & MIDX_WRITE_REV_INDEX) {
 				/*
 				 * If generating a reverse index, need to have
@@ -1330,10 +1333,10 @@ static int write_midx_internal(const char *object_dir,
 				if (open_pack_index(ctx.m->packs[i]))
 					die(_("could not open index for %s"),
 					    ctx.m->packs[i]->pack_name);
-				ctx.info[ctx.nr].p = ctx.m->packs[i];
 			}
 
-			ctx.nr++;
+			fill_pack_info(&ctx.info[ctx.nr++], ctx.m->packs[i],
+				       xstrdup(ctx.m->pack_names[i]), i);
 		}
 	}
 
-- 
2.43.0.24.g980b318f98


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 05/24] midx: implement `DISP` chunk
  2023-11-28 19:07 [PATCH 00/24] pack-objects: multi-pack verbatim reuse Taylor Blau
                   ` (3 preceding siblings ...)
  2023-11-28 19:08 ` [PATCH 04/24] midx: factor out `fill_pack_info()` Taylor Blau
@ 2023-11-28 19:08 ` Taylor Blau
  2023-11-30 10:18   ` Patrick Steinhardt
  2023-12-03 13:15   ` Junio C Hamano
  2023-11-28 19:08 ` [PATCH 06/24] midx: implement `midx_locate_pack()` Taylor Blau
                   ` (21 subsequent siblings)
  26 siblings, 2 replies; 107+ messages in thread
From: Taylor Blau @ 2023-11-28 19:08 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano

When a multi-pack bitmap is used to implement verbatim pack reuse (that
is, when verbatim chunks from an on-disk packfile are copied
directly[^1]), it does so by using its "preferred pack" as the source
for pack-reuse.

This allows repositories to pack the majority of their objects into a
single (often large) pack, and then use it as the single source for
verbatim pack reuse. This increases the amount of objects that are
reused verbatim (and consequently, decrease the amount of time it takes
to generate many packs). But this performance comes at a cost, which is
that the preferred packfile must pace its growth with that of the entire
repository in order to maintain the utility of verbatim pack reuse.

As repositories grow beyond what we can reasonably store in a single
packfile, the utility of verbatim pack reuse diminishes. Or, at the very
least, it becomes increasingly more expensive to maintain as the pack
grows larger and larger.

It would be beneficial to be able to perform this same optimization over
multiple packs, provided some modest constraints (most importantly, that
the set of packs eligible for verbatim reuse are disjoint with respect
to the objects that they contain).

If we assume that the packs which we treat as candidates for verbatim
reuse are disjoint with respect to their objects, we need to make only
modest modifications to the verbatim pack-reuse code itself. Most
notably, we need to remove the assumption that the bits in the
reachability bitmap corresponding to objects from the single reuse pack
begin at the first bit position.

Future patches will unwind these assumptions and reimplement their
existing functionality as special cases of the more general assumptions
(e.g. that reuse bits can start anywhere within the bitset, but happen
to start at 0 for all existing cases).

This patch does not yet relax any of those assumptions. Instead, it
implements a foundational data-structure, the "Disjoint Packs" (`DISP`)
chunk of the multi-pack index. The `DISP` chunk's contents are described
in detail here. Importantly, the `DISP` chunk contains information to
map regions of a multi-pack index's reachability bitmap to the packs
whose objects they represent.

For now, this chunk is only written, not read (outside of the test-tool
used in this patch to test the new chunk's behavior). Future patches
will begin to make use of this new chunk.

This patch implements reading (though no callers outside of the above
one perform any reading) and writing this new chunk. It also extends the
`--stdin-packs` format used by the `git multi-pack-index write` builtin
to be able to designate that a given pack is to be marked as "disjoint"
by prefixing it with a '+' character.

[^1]: Modulo patching any `OFS_DELTA`'s that cross over a region of the
  pack that wasn't used verbatim.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-multi-pack-index.txt |   4 +
 Documentation/gitformat-pack.txt       | 109 +++++++++++++++++++++++
 builtin/multi-pack-index.c             |  10 ++-
 midx.c                                 | 116 ++++++++++++++++++++++---
 midx.h                                 |   5 ++
 pack-bitmap.h                          |   9 ++
 t/helper/test-read-midx.c              |  31 ++++++-
 t/t5319-multi-pack-index.sh            |  58 +++++++++++++
 8 files changed, 325 insertions(+), 17 deletions(-)

diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
index 3696506eb3..d130e65b28 100644
--- a/Documentation/git-multi-pack-index.txt
+++ b/Documentation/git-multi-pack-index.txt
@@ -49,6 +49,10 @@ write::
 	--stdin-packs::
 		Write a multi-pack index containing only the set of
 		line-delimited pack index basenames provided over stdin.
+		Lines beginning with a '+' character (followed by the
+		pack index basename as before) have their pack marked as
+		"disjoint". See the "`DISP` chunk and disjoint packs"
+		section in linkgit:gitformat-pack[5] for more.
 
 	--refs-snapshot=<path>::
 		With `--bitmap`, optionally specify a file which
diff --git a/Documentation/gitformat-pack.txt b/Documentation/gitformat-pack.txt
index 9fcb29a9c8..658682ddd5 100644
--- a/Documentation/gitformat-pack.txt
+++ b/Documentation/gitformat-pack.txt
@@ -396,6 +396,22 @@ CHUNK DATA:
 	    is padded at the end with between 0 and 3 NUL bytes to make the
 	    chunk size a multiple of 4 bytes.
 
+	Disjoint Packfiles (ID: {'D', 'I', 'S', 'P'})
+	    Stores a table of three 4-byte unsigned integers in network order.
+	    Each table entry corresponds to a single pack (in the order that
+	    they appear above in the `PNAM` chunk). The values for each table
+	    entry are as follows:
+	    - The first bit position (in psuedo-pack order, see below) to
+	      contain an object from that pack.
+	    - The number of bits whose objects are selected from that pack.
+	    - A "meta" value, whose least-significant bit indicates whether or
+	      not the pack is disjoint with respect to other packs. The
+	      remaining bits are unused.
+	    Two packs are "disjoint" with respect to one another when they have
+	    disjoint sets of objects. In other words, any object found in a pack
+	    contained in the set of disjoint packfiles is guaranteed to be
+	    uniquely located among those packs.
+
 	OID Fanout (ID: {'O', 'I', 'D', 'F'})
 	    The ith entry, F[i], stores the number of OIDs with first
 	    byte at most i. Thus F[255] stores the total
@@ -509,6 +525,99 @@ packs arranged in MIDX order (with the preferred pack coming first).
 The MIDX's reverse index is stored in the optional 'RIDX' chunk within
 the MIDX itself.
 
+=== `DISP` chunk and disjoint packs
+
+The Disjoint Packfiles (`DISP`) chunk encodes additional information
+about the objects in the multi-pack index's reachability bitmap. Recall
+that objects from the MIDX are arranged in "pseudo-pack" order (see:
+above) for reachability bitmaps.
+
+From the example above, suppose we have packs "a", "b", and "c", with
+10, 15, and 20 objects, respectively. In pseudo-pack order, those would
+be arranged as follows:
+
+    |a,0|a,1|...|a,9|b,0|b,1|...|b,14|c,0|c,1|...|c,19|
+
+When working with single-pack bitmaps (or, equivalently, multi-pack
+reachability bitmaps without any packs marked as disjoint),
+linkgit:git-pack-objects[1] performs ``verbatim'' reuse, attempting to
+reuse chunks of the existing packfile instead of adding objects to the
+packing list.
+
+When a chunk of bytes are reused from an existing pack, any objects
+contained therein do not need to be added to the packing list, saving
+memory and CPU time. But a chunk from an existing packfile can only be
+reused when the following conditions are met:
+
+  - The chunk contains only objects which were requested by the caller
+    (i.e. does not contain any objects which the caller didn't ask for
+    explicitly or implicitly).
+
+  - All objects stored as offset- or reference-deltas also include their
+    base object in the resulting pack.
+
+Additionally, packfiles many not contain more than one copy of any given
+object. This introduces an additional constraint over the set of packs
+we may want to reuse. The most straightforward approach is to mandate
+that the set of packs is disjoint with respect to the set of objects
+contained in each pack. In other words, for each object `o` in the union
+of all objects stored by the disjoint set of packs, `o` is contained in
+exactly one pack from the disjoint set.
+
+One alternative design choice for multi-pack reuse might instead involve
+imposing a chunk-level constraint that allows packs in the reusable set
+to contain multiple copies across different packs, but restricts each
+chunk against including more than one copy of such an object. This is in
+theory possible to implement, but significantly more complicated than
+forcing packs themselves to be disjoint. Most notably, we would have to
+keep track of which objects have already been sent during verbatim
+pack-reuse, defeating the main purpose of verbatim pack reuse (that we
+don't have to keep track of individual objects).
+
+The `DISP` chunk encodes the necessary information in order to implement
+multi-pack reuse over a disjoint set of packs as described above.
+Specifically, the `DISP` chunk encodes three pieces of information (all
+32-bit unsigned integers in network byte-order) for each packfile `p`
+that is stored in the MIDX, as follows:
+
+`bitmap_pos`:: The first bit position (in pseudo-pack order) in the
+  multi-pack index's reachability bitmap occupied by an object from `p`.
+
+`bitmap_nr`:: The number of bit positions (including the one at
+  `bitmap_pos`) that encode objects from that pack `p`.
+
+`disjoint`:: Metadata, including whether or not the pack `p` is
+  ``disjoint''. The least significant bit stores whether or not the pack
+  is disjoint. The remaining bits are reserved for future use.
+
+For example, the `DISP` chunk corresponding to the above example (with
+packs ``a'', ``b'', and ``c'') would look like:
+
+[cols="1,2,2,2"]
+|===
+| |`bitmap_pos` |`bitmap_nr` |`disjoint`
+
+|packfile ``a''
+|`0`
+|`10`
+|`0x1`
+
+|packfile ``b''
+|`10`
+|`15`
+|`0x1`
+
+|packfile ``c''
+|`25`
+|`20`
+|`0x1`
+|===
+
+With these constraints and information in place, we can treat each
+packfile marked as disjoint as individually reusable in the same fashion
+as verbatim pack reuse is performed on individual packs prior to the
+implementation of the `DISP` chunk.
+
 == cruft packs
 
 The cruft packs feature offer an alternative to Git's traditional mechanism of
diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
index a72aebecaa..0f1dd4651d 100644
--- a/builtin/multi-pack-index.c
+++ b/builtin/multi-pack-index.c
@@ -106,11 +106,17 @@ static int git_multi_pack_index_write_config(const char *var, const char *value,
 	return 0;
 }
 
+#define DISJOINT ((void*)(uintptr_t)1)
+
 static void read_packs_from_stdin(struct string_list *to)
 {
 	struct strbuf buf = STRBUF_INIT;
-	while (strbuf_getline(&buf, stdin) != EOF)
-		string_list_append(to, buf.buf);
+	while (strbuf_getline(&buf, stdin) != EOF) {
+		if (*buf.buf == '+')
+			string_list_append(to, buf.buf + 1)->util = DISJOINT;
+		else
+			string_list_append(to, buf.buf);
+	}
 	string_list_sort(to);
 
 	strbuf_release(&buf);
diff --git a/midx.c b/midx.c
index 591b3c636e..f55020072f 100644
--- a/midx.c
+++ b/midx.c
@@ -33,6 +33,7 @@
 
 #define MIDX_CHUNK_ALIGNMENT 4
 #define MIDX_CHUNKID_PACKNAMES 0x504e414d /* "PNAM" */
+#define MIDX_CHUNKID_DISJOINTPACKS 0x44495350 /* "DISP" */
 #define MIDX_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
 #define MIDX_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
 #define MIDX_CHUNKID_OBJECTOFFSETS 0x4f4f4646 /* "OOFF" */
@@ -182,6 +183,9 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
 
 	pair_chunk(cf, MIDX_CHUNKID_LARGEOFFSETS, &m->chunk_large_offsets,
 		   &m->chunk_large_offsets_len);
+	pair_chunk(cf, MIDX_CHUNKID_DISJOINTPACKS,
+		   (const unsigned char **)&m->chunk_disjoint_packs,
+		   &m->chunk_disjoint_packs_len);
 
 	if (git_env_bool("GIT_TEST_MIDX_READ_RIDX", 1))
 		pair_chunk(cf, MIDX_CHUNKID_REVINDEX, &m->chunk_revindex,
@@ -275,6 +279,23 @@ int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, uint32_t
 	return 0;
 }
 
+int nth_bitmapped_pack(struct repository *r, struct multi_pack_index *m,
+		       struct bitmapped_pack *bp, uint32_t pack_int_id)
+{
+	if (!m->chunk_disjoint_packs)
+		return error(_("MIDX does not contain the DISP chunk"));
+
+	if (prepare_midx_pack(r, m, pack_int_id))
+		return error(_("could not load disjoint pack %"PRIu32), pack_int_id);
+
+	bp->p = m->packs[pack_int_id];
+	bp->bitmap_pos = get_be32(m->chunk_disjoint_packs + 3 * pack_int_id);
+	bp->bitmap_nr = get_be32(m->chunk_disjoint_packs + 3 * pack_int_id + 1);
+	bp->disjoint = !!get_be32(m->chunk_disjoint_packs + 3 * pack_int_id + 2);
+
+	return 0;
+}
+
 int bsearch_midx(const struct object_id *oid, struct multi_pack_index *m, uint32_t *result)
 {
 	return bsearch_hash(oid->hash, m->chunk_oid_fanout, m->chunk_oid_lookup,
@@ -457,11 +478,18 @@ static size_t write_midx_header(struct hashfile *f,
 	return MIDX_HEADER_SIZE;
 }
 
+#define BITMAP_POS_UNKNOWN (~((uint32_t)0))
+
 struct pack_info {
 	uint32_t orig_pack_int_id;
 	char *pack_name;
 	struct packed_git *p;
-	unsigned expired : 1;
+
+	uint32_t bitmap_pos;
+	uint32_t bitmap_nr;
+
+	unsigned expired : 1,
+		 disjoint : 1;
 };
 
 static void fill_pack_info(struct pack_info *info,
@@ -473,6 +501,7 @@ static void fill_pack_info(struct pack_info *info,
 	info->orig_pack_int_id = orig_pack_int_id;
 	info->pack_name = pack_name;
 	info->p = p;
+	info->bitmap_pos = BITMAP_POS_UNKNOWN;
 }
 
 static int pack_info_compare(const void *_a, const void *_b)
@@ -516,6 +545,7 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
 {
 	struct write_midx_context *ctx = data;
 	struct packed_git *p;
+	struct string_list_item *item = NULL;
 
 	if (ends_with(file_name, ".idx")) {
 		display_progress(ctx->progress, ++ctx->pack_paths_checked);
@@ -534,11 +564,13 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
 		 * should be performed independently (likely checking
 		 * to_include before the existing MIDX).
 		 */
-		if (ctx->m && midx_contains_pack(ctx->m, file_name))
-			return;
-		else if (ctx->to_include &&
-			 !string_list_has_string(ctx->to_include, file_name))
+		if (ctx->m && midx_contains_pack(ctx->m, file_name)) {
 			return;
+		} else if (ctx->to_include) {
+			item = string_list_lookup(ctx->to_include, file_name);
+			if (!item)
+				return;
+		}
 
 		ALLOC_GROW(ctx->info, ctx->nr + 1, ctx->alloc);
 
@@ -559,6 +591,8 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
 
 		fill_pack_info(&ctx->info[ctx->nr], p, xstrdup(file_name),
 			       ctx->nr);
+		if (item)
+			ctx->info[ctx->nr].disjoint = !!item->util;
 		ctx->nr++;
 	}
 }
@@ -568,7 +602,8 @@ struct pack_midx_entry {
 	uint32_t pack_int_id;
 	time_t pack_mtime;
 	uint64_t offset;
-	unsigned preferred : 1;
+	unsigned preferred : 1,
+		 disjoint : 1;
 };
 
 static int midx_oid_compare(const void *_a, const void *_b)
@@ -586,6 +621,12 @@ static int midx_oid_compare(const void *_a, const void *_b)
 	if (a->preferred < b->preferred)
 		return 1;
 
+	/* Sort objects in a disjoint pack last when multiple copies exist. */
+	if (a->disjoint < b->disjoint)
+		return -1;
+	if (a->disjoint > b->disjoint)
+		return 1;
+
 	if (a->pack_mtime > b->pack_mtime)
 		return -1;
 	else if (a->pack_mtime < b->pack_mtime)
@@ -671,6 +712,7 @@ static void midx_fanout_add_midx_fanout(struct midx_fanout *fanout,
 					   &fanout->entries[fanout->nr],
 					   cur_object);
 		fanout->entries[fanout->nr].preferred = 0;
+		fanout->entries[fanout->nr].disjoint = 0;
 		fanout->nr++;
 	}
 }
@@ -696,6 +738,7 @@ static void midx_fanout_add_pack_fanout(struct midx_fanout *fanout,
 				cur_object,
 				&fanout->entries[fanout->nr],
 				preferred);
+		fanout->entries[fanout->nr].disjoint = info->disjoint;
 		fanout->nr++;
 	}
 }
@@ -764,14 +807,22 @@ static struct pack_midx_entry *get_sorted_entries(struct multi_pack_index *m,
 		 * Take only the first duplicate.
 		 */
 		for (cur_object = 0; cur_object < fanout.nr; cur_object++) {
-			if (cur_object && oideq(&fanout.entries[cur_object - 1].oid,
-						&fanout.entries[cur_object].oid))
-				continue;
+			struct pack_midx_entry *ours = &fanout.entries[cur_object];
+			if (cur_object) {
+				struct pack_midx_entry *prev = &fanout.entries[cur_object - 1];
+				if (oideq(&prev->oid, &ours->oid)) {
+					if (prev->disjoint && ours->disjoint)
+						die(_("duplicate object '%s' among disjoint packs '%s', '%s'"),
+						    oid_to_hex(&prev->oid),
+						    info[prev->pack_int_id].pack_name,
+						    info[ours->pack_int_id].pack_name);
+					continue;
+				}
+			}
 
 			ALLOC_GROW(deduplicated_entries, st_add(*nr_objects, 1),
 				   alloc_objects);
-			memcpy(&deduplicated_entries[*nr_objects],
-			       &fanout.entries[cur_object],
+			memcpy(&deduplicated_entries[*nr_objects], ours,
 			       sizeof(struct pack_midx_entry));
 			(*nr_objects)++;
 		}
@@ -814,6 +865,27 @@ static int write_midx_pack_names(struct hashfile *f, void *data)
 	return 0;
 }
 
+static int write_midx_disjoint_packs(struct hashfile *f, void *data)
+{
+	struct write_midx_context *ctx = data;
+	size_t i;
+
+	for (i = 0; i < ctx->nr; i++) {
+		struct pack_info *pack = &ctx->info[i];
+		if (pack->expired)
+			continue;
+
+		if (pack->bitmap_pos == BITMAP_POS_UNKNOWN && pack->bitmap_nr)
+			BUG("pack '%s' has no bitmap position, but has %d bitmapped object(s)",
+			    pack->pack_name, pack->bitmap_nr);
+
+		hashwrite_be32(f, pack->bitmap_pos);
+		hashwrite_be32(f, pack->bitmap_nr);
+		hashwrite_be32(f, !!pack->disjoint);
+	}
+	return 0;
+}
+
 static int write_midx_oid_fanout(struct hashfile *f,
 				 void *data)
 {
@@ -981,8 +1053,19 @@ static uint32_t *midx_pack_order(struct write_midx_context *ctx)
 	QSORT(data, ctx->entries_nr, midx_pack_order_cmp);
 
 	ALLOC_ARRAY(pack_order, ctx->entries_nr);
-	for (i = 0; i < ctx->entries_nr; i++)
+	for (i = 0; i < ctx->entries_nr; i++) {
+		struct pack_midx_entry *e = &ctx->entries[data[i].nr];
+		struct pack_info *pack = &ctx->info[ctx->pack_perm[e->pack_int_id]];
+		if (pack->bitmap_pos == BITMAP_POS_UNKNOWN)
+			pack->bitmap_pos = i;
+		pack->bitmap_nr++;
 		pack_order[i] = data[i].nr;
+	}
+	for (i = 0; i < ctx->nr; i++) {
+		struct pack_info *pack = &ctx->info[ctx->pack_perm[i]];
+		if (pack->bitmap_pos == BITMAP_POS_UNKNOWN)
+			pack->bitmap_pos = 0;
+	}
 	free(data);
 
 	trace2_region_leave("midx", "midx_pack_order", the_repository);
@@ -1283,6 +1366,7 @@ static int write_midx_internal(const char *object_dir,
 	struct hashfile *f = NULL;
 	struct lock_file lk;
 	struct write_midx_context ctx = { 0 };
+	int pack_disjoint_concat_len = 0;
 	int pack_name_concat_len = 0;
 	int dropped_packs = 0;
 	int result = 0;
@@ -1495,8 +1579,10 @@ static int write_midx_internal(const char *object_dir,
 	}
 
 	for (i = 0; i < ctx.nr; i++) {
-		if (!ctx.info[i].expired)
-			pack_name_concat_len += strlen(ctx.info[i].pack_name) + 1;
+		if (ctx.info[i].expired)
+			continue;
+		pack_name_concat_len += strlen(ctx.info[i].pack_name) + 1;
+		pack_disjoint_concat_len += 3 * sizeof(uint32_t);
 	}
 
 	/* Check that the preferred pack wasn't expired (if given). */
@@ -1556,6 +1642,8 @@ static int write_midx_internal(const char *object_dir,
 		add_chunk(cf, MIDX_CHUNKID_REVINDEX,
 			  st_mult(ctx.entries_nr, sizeof(uint32_t)),
 			  write_midx_revindex);
+		add_chunk(cf, MIDX_CHUNKID_DISJOINTPACKS,
+			  pack_disjoint_concat_len, write_midx_disjoint_packs);
 	}
 
 	write_midx_header(f, get_num_chunks(cf), ctx.nr - dropped_packs);
diff --git a/midx.h b/midx.h
index a5d98919c8..cdd16a8378 100644
--- a/midx.h
+++ b/midx.h
@@ -7,6 +7,7 @@
 struct object_id;
 struct pack_entry;
 struct repository;
+struct bitmapped_pack;
 
 #define GIT_TEST_MULTI_PACK_INDEX "GIT_TEST_MULTI_PACK_INDEX"
 #define GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP \
@@ -33,6 +34,8 @@ struct multi_pack_index {
 
 	const unsigned char *chunk_pack_names;
 	size_t chunk_pack_names_len;
+	const uint32_t *chunk_disjoint_packs;
+	size_t chunk_disjoint_packs_len;
 	const uint32_t *chunk_oid_fanout;
 	const unsigned char *chunk_oid_lookup;
 	const unsigned char *chunk_object_offsets;
@@ -58,6 +61,8 @@ void get_midx_rev_filename(struct strbuf *out, struct multi_pack_index *m);
 
 struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local);
 int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, uint32_t pack_int_id);
+int nth_bitmapped_pack(struct repository *r, struct multi_pack_index *m,
+		       struct bitmapped_pack *bp, uint32_t pack_int_id);
 int bsearch_midx(const struct object_id *oid, struct multi_pack_index *m, uint32_t *result);
 off_t nth_midxed_offset(struct multi_pack_index *m, uint32_t pos);
 uint32_t nth_midxed_pack_int_id(struct multi_pack_index *m, uint32_t pos);
diff --git a/pack-bitmap.h b/pack-bitmap.h
index 5273a6a019..b7fa1a42a9 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -52,6 +52,15 @@ typedef int (*show_reachable_fn)(
 
 struct bitmap_index;
 
+struct bitmapped_pack {
+	struct packed_git *p;
+
+	uint32_t bitmap_pos;
+	uint32_t bitmap_nr;
+
+	unsigned disjoint : 1;
+};
+
 struct bitmap_index *prepare_bitmap_git(struct repository *r);
 struct bitmap_index *prepare_midx_bitmap_git(struct multi_pack_index *midx);
 void count_bitmap_commit_list(struct bitmap_index *, uint32_t *commits,
diff --git a/t/helper/test-read-midx.c b/t/helper/test-read-midx.c
index e9a444ddba..4b44995dca 100644
--- a/t/helper/test-read-midx.c
+++ b/t/helper/test-read-midx.c
@@ -100,10 +100,37 @@ static int read_midx_preferred_pack(const char *object_dir)
 	return 0;
 }
 
+static int read_midx_bitmapped_packs(const char *object_dir)
+{
+	struct multi_pack_index *midx = NULL;
+	struct bitmapped_pack pack;
+	uint32_t i;
+
+	setup_git_directory();
+
+	midx = load_multi_pack_index(object_dir, 1);
+	if (!midx)
+		return 1;
+
+	for (i = 0; i < midx->num_packs; i++) {
+		if (nth_bitmapped_pack(the_repository, midx, &pack, i) < 0)
+			return 1;
+
+		printf("%s\n", pack_basename(pack.p));
+		printf("  bitmap_pos: %"PRIuMAX"\n", (uintmax_t)pack.bitmap_pos);
+		printf("  bitmap_nr: %"PRIuMAX"\n", (uintmax_t)pack.bitmap_nr);
+		printf("  disjoint: %s\n", pack.disjoint & 0x1 ? "yes" : "no");
+	}
+
+	close_midx(midx);
+
+	return 0;
+}
+
 int cmd__read_midx(int argc, const char **argv)
 {
 	if (!(argc == 2 || argc == 3))
-		usage("read-midx [--show-objects|--checksum|--preferred-pack] <object-dir>");
+		usage("read-midx [--show-objects|--checksum|--preferred-pack|--bitmap] <object-dir>");
 
 	if (!strcmp(argv[1], "--show-objects"))
 		return read_midx_file(argv[2], 1);
@@ -111,5 +138,7 @@ int cmd__read_midx(int argc, const char **argv)
 		return read_midx_checksum(argv[2]);
 	else if (!strcmp(argv[1], "--preferred-pack"))
 		return read_midx_preferred_pack(argv[2]);
+	else if (!strcmp(argv[1], "--bitmap"))
+		return read_midx_bitmapped_packs(argv[2]);
 	return read_midx_file(argv[1], 0);
 }
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index c4c6060cee..fd24e0c952 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -1157,4 +1157,62 @@ test_expect_success 'reader notices too-small revindex chunk' '
 	test_cmp expect.err err
 '
 
+test_expect_success 'disjoint packs are stored via the DISP chunk' '
+	test_when_finished "rm -fr repo" &&
+	git init repo &&
+	(
+		cd repo &&
+
+		for i in 1 2 3 4 5
+		do
+			test_commit "$i" &&
+			git repack -d || return 1
+		done &&
+
+		find $objdir/pack -type f -name "*.idx" | xargs -n 1 basename | sort >packs &&
+
+		git multi-pack-index write --stdin-packs <packs &&
+		test_must_fail test-tool read-midx --bitmap $objdir 2>err &&
+		cat >expect <<-\EOF &&
+		error: MIDX does not contain the DISP chunk
+		EOF
+		test_cmp expect err &&
+
+		sed -e "s/^/+/g" packs >in &&
+		git multi-pack-index write --stdin-packs --bitmap \
+			--preferred-pack="$(head -n1 <packs)" <in &&
+		test-tool read-midx --bitmap $objdir >actual &&
+		for i in $(test_seq $(wc -l <packs))
+		do
+			sed -ne "${i}s/\.idx$/\.pack/p" packs &&
+			echo "  bitmap_pos: $(( $(( $i - 1 )) * 3 ))" &&
+			echo "  bitmap_nr: 3" &&
+			echo "  disjoint: yes" || return 1
+		done >expect &&
+		test_cmp expect actual
+	)
+'
+
+test_expect_success 'non-disjoint packs are detected' '
+	test_when_finished "rm -fr repo" &&
+	git init repo &&
+	(
+		cd repo &&
+
+		test_commit base &&
+		git repack -d &&
+		test_commit other &&
+		git repack -a &&
+
+		ls -la .git/objects/pack/ &&
+
+		find $objdir/pack -type f -name "*.idx" |
+			sed -e "s/.*\/\(.*\)$/+\1/g" >in &&
+
+		test_must_fail git multi-pack-index write --stdin-packs \
+			--bitmap <in 2>err &&
+		grep "duplicate object.* among disjoint packs" err
+	)
+'
+
 test_done
-- 
2.43.0.24.g980b318f98


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 06/24] midx: implement `midx_locate_pack()`
  2023-11-28 19:07 [PATCH 00/24] pack-objects: multi-pack verbatim reuse Taylor Blau
                   ` (4 preceding siblings ...)
  2023-11-28 19:08 ` [PATCH 05/24] midx: implement `DISP` chunk Taylor Blau
@ 2023-11-28 19:08 ` Taylor Blau
  2023-11-28 19:08 ` [PATCH 07/24] midx: implement `--retain-disjoint` mode Taylor Blau
                   ` (20 subsequent siblings)
  26 siblings, 0 replies; 107+ messages in thread
From: Taylor Blau @ 2023-11-28 19:08 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano

The multi-pack index API exposes a `midx_contains_pack()` function that
takes in a string ending in either ".idx" or ".pack" and returns whether
or not the MIDX contains a given pack corresponding to that string.

There is no corresponding function to locate the position of a pack
within the MIDX's pack order (sorted lexically by pack filename).

We could add an optional out parameter to `midx_contains_pack()` that is
filled out with the pack's position when the parameter is non-NULL. To
minimize the amount of fallout from this change, instead introduce a new
function by renaming `midx_contains_pack()` to `midx_locate_pack()`,
adding that output parameter, and then reimplementing
`midx_contains_pack()` in terms of it.

Future patches will make use of this new function.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 midx.c | 13 +++++++++++--
 midx.h |  5 ++++-
 2 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/midx.c b/midx.c
index f55020072f..65ba0c70fe 100644
--- a/midx.c
+++ b/midx.c
@@ -413,7 +413,8 @@ static int cmp_idx_or_pack_name(const char *idx_or_pack_name,
 	return strcmp(idx_or_pack_name, idx_name);
 }
 
-int midx_contains_pack(struct multi_pack_index *m, const char *idx_or_pack_name)
+int midx_locate_pack(struct multi_pack_index *m, const char *idx_or_pack_name,
+		     uint32_t *pos)
 {
 	uint32_t first = 0, last = m->num_packs;
 
@@ -424,8 +425,11 @@ int midx_contains_pack(struct multi_pack_index *m, const char *idx_or_pack_name)
 
 		current = m->pack_names[mid];
 		cmp = cmp_idx_or_pack_name(idx_or_pack_name, current);
-		if (!cmp)
+		if (!cmp) {
+			if (pos)
+				*pos = mid;
 			return 1;
+		}
 		if (cmp > 0) {
 			first = mid + 1;
 			continue;
@@ -436,6 +440,11 @@ int midx_contains_pack(struct multi_pack_index *m, const char *idx_or_pack_name)
 	return 0;
 }
 
+int midx_contains_pack(struct multi_pack_index *m, const char *idx_or_pack_name)
+{
+	return midx_locate_pack(m, idx_or_pack_name, NULL);
+}
+
 int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, int local)
 {
 	struct multi_pack_index *m;
diff --git a/midx.h b/midx.h
index cdd16a8378..a6e969c2ea 100644
--- a/midx.h
+++ b/midx.h
@@ -70,7 +70,10 @@ struct object_id *nth_midxed_object_oid(struct object_id *oid,
 					struct multi_pack_index *m,
 					uint32_t n);
 int fill_midx_entry(struct repository *r, const struct object_id *oid, struct pack_entry *e, struct multi_pack_index *m);
-int midx_contains_pack(struct multi_pack_index *m, const char *idx_or_pack_name);
+int midx_contains_pack(struct multi_pack_index *m,
+		       const char *idx_or_pack_name);
+int midx_locate_pack(struct multi_pack_index *m, const char *idx_or_pack_name,
+		     uint32_t *pos);
 int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, int local);
 
 /*
-- 
2.43.0.24.g980b318f98


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 07/24] midx: implement `--retain-disjoint` mode
  2023-11-28 19:07 [PATCH 00/24] pack-objects: multi-pack verbatim reuse Taylor Blau
                   ` (5 preceding siblings ...)
  2023-11-28 19:08 ` [PATCH 06/24] midx: implement `midx_locate_pack()` Taylor Blau
@ 2023-11-28 19:08 ` Taylor Blau
  2023-11-30 10:18   ` Patrick Steinhardt
  2023-11-28 19:08 ` [PATCH 08/24] pack-objects: implement `--ignore-disjoint` mode Taylor Blau
                   ` (19 subsequent siblings)
  26 siblings, 1 reply; 107+ messages in thread
From: Taylor Blau @ 2023-11-28 19:08 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano

Once multi-pack reachability bitmaps learn how to perform pack reuse
over the set of disjoint packs, we will want to teach `git repack` to
evolve the set of disjoint packs over time.

To evolve the set of disjoint packs means any new packs made by `repack`
should be disjoint with respect to the existing set of disjoint packs so
as to be able to join that set when updating the multi-pack index.

The details of generating such packs will be left to future commits. But
any new pack(s) created by repack as disjoint will be marked as such by
passing them over `--stdin-packs` with the special '+' marker when
generating a new MIDX.

This patch, however, addresses the question of how we retain the
existing set of disjoint packs when updating the multi-pack index. One
option would be for `repack` to keep track of the set of disjoint packs
itself by querying the MIDX, and then adding the special '+' marker
appropriately when generating the input for `--stdin-packs`.

But this is verbose and error-prone, since two different parts of Git
would need to maintain the same notion of the set of disjoint packs.
When one disagrees with the other, the set of so-called disjoint packs
may actually contain two or more packs which have one or more object(s)
in common, making the set non-disjoint.

Instead, introduce a `--retain-disjoint` mode for the `git
multi-pack-index write` sub-command which keeps any packs which are:

  - marked as disjoint in the existing MIDX, and

  - not deleted (e.g., they are not excluded from the input for
    `--stdin-packs`).

This will allow the `repack` command to not have to keep track of the
set of currently-disjoint packs itself, reducing the number of lines of
code necessary to implement this feature, and making the resulting
implementation less error-prone.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-multi-pack-index.txt |  8 +++
 builtin/multi-pack-index.c             |  3 +
 midx.c                                 | 49 +++++++++++++++
 midx.h                                 |  1 +
 t/lib-disjoint.sh                      | 38 ++++++++++++
 t/t5319-multi-pack-index.sh            | 82 ++++++++++++++++++++++++++
 6 files changed, 181 insertions(+)
 create mode 100644 t/lib-disjoint.sh

diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
index d130e65b28..ac0c7b124b 100644
--- a/Documentation/git-multi-pack-index.txt
+++ b/Documentation/git-multi-pack-index.txt
@@ -54,6 +54,14 @@ write::
 		"disjoint". See the "`DISP` chunk and disjoint packs"
 		section in linkgit:gitformat-pack[5] for more.
 
+	--retain-disjoint::
+		When writing a multi-pack index with a reachability
+		bitmap, keep any packs marked as disjoint in the
+		existing MIDX (if any) as such in the new MIDX. Existing
+		disjoint packs which are removed (e.g., not listed via
+		`--stdin-packs`) are ignored. This option works in
+		addition to the '+' marker for `--stdin-packs`.
+
 	--refs-snapshot=<path>::
 		With `--bitmap`, optionally specify a file which
 		contains a "refs snapshot" taken prior to repacking.
diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
index 0f1dd4651d..dcfabf2626 100644
--- a/builtin/multi-pack-index.c
+++ b/builtin/multi-pack-index.c
@@ -138,6 +138,9 @@ static int cmd_multi_pack_index_write(int argc, const char **argv,
 			 N_("write multi-pack index containing only given indexes")),
 		OPT_FILENAME(0, "refs-snapshot", &opts.refs_snapshot,
 			     N_("refs snapshot for selecting bitmap commits")),
+		OPT_BIT(0, "retain-disjoint", &opts.flags,
+			N_("retain non-deleted disjoint packs"),
+			MIDX_WRITE_RETAIN_DISJOINT),
 		OPT_END(),
 	};
 
diff --git a/midx.c b/midx.c
index 65ba0c70fe..ce67da9f85 100644
--- a/midx.c
+++ b/midx.c
@@ -721,6 +721,12 @@ static void midx_fanout_add_midx_fanout(struct midx_fanout *fanout,
 					   &fanout->entries[fanout->nr],
 					   cur_object);
 		fanout->entries[fanout->nr].preferred = 0;
+		/*
+		 * It's OK to set disjoint to 0 here, even with
+		 * `--retain-disjoint`, since we will always see the disjoint
+		 * copy of some object below in get_sorted_entries(), causing us
+		 * to die().
+		 */
 		fanout->entries[fanout->nr].disjoint = 0;
 		fanout->nr++;
 	}
@@ -1362,6 +1368,37 @@ static struct multi_pack_index *lookup_multi_pack_index(struct repository *r,
 	return result;
 }
 
+static int midx_retain_existing_disjoint(struct repository *r,
+					 struct multi_pack_index *from,
+					 struct write_midx_context *ctx)
+{
+	struct bitmapped_pack bp;
+	uint32_t i, midx_pos;
+
+	for (i = 0; i < ctx->nr; i++) {
+		struct pack_info *info = &ctx->info[i];
+		/*
+		 * Having to call `midx_locate_pack()` in a loop is
+		 * sub-optimal, since it is O(n*log(n)) in the number
+		 * of packs.
+		 *
+		 * When reusing an existing MIDX, we know that the first
+		 * 'n' packs appear in the same order, so we could avoid
+		 * this when reusing an existing MIDX. But we may be
+		 * instead relying on the order given to us by
+		 * for_each_file_in_pack_dir(), in which case we can't
+		 * make any such guarantees.
+		 */
+		if (!midx_locate_pack(from, info->pack_name, &midx_pos))
+			continue;
+
+		if (nth_bitmapped_pack(r, from, &bp, midx_pos) < 0)
+			return -1;
+		info->disjoint = bp.disjoint;
+	}
+	return 0;
+}
+
 static int write_midx_internal(const char *object_dir,
 			       struct string_list *packs_to_include,
 			       struct string_list *packs_to_drop,
@@ -1444,6 +1481,18 @@ static int write_midx_internal(const char *object_dir,
 	for_each_file_in_pack_dir(object_dir, add_pack_to_midx, &ctx);
 	stop_progress(&ctx.progress);
 
+	if (flags & MIDX_WRITE_RETAIN_DISJOINT) {
+		struct multi_pack_index *m = ctx.m;
+		if (!m)
+			m = lookup_multi_pack_index(the_repository, object_dir);
+
+		if (m) {
+			result = midx_retain_existing_disjoint(the_repository, m, &ctx);
+			if (result)
+				goto cleanup;
+		}
+	}
+
 	if ((ctx.m && ctx.nr == ctx.m->num_packs) &&
 	    !(packs_to_include || packs_to_drop)) {
 		struct bitmap_index *bitmap_git;
diff --git a/midx.h b/midx.h
index a6e969c2ea..d7ce52ff7b 100644
--- a/midx.h
+++ b/midx.h
@@ -54,6 +54,7 @@ struct multi_pack_index {
 #define MIDX_WRITE_BITMAP (1 << 2)
 #define MIDX_WRITE_BITMAP_HASH_CACHE (1 << 3)
 #define MIDX_WRITE_BITMAP_LOOKUP_TABLE (1 << 4)
+#define MIDX_WRITE_RETAIN_DISJOINT (1 << 5)
 
 const unsigned char *get_midx_checksum(struct multi_pack_index *m);
 void get_midx_filename(struct strbuf *out, const char *object_dir);
diff --git a/t/lib-disjoint.sh b/t/lib-disjoint.sh
new file mode 100644
index 0000000000..c6c6e74aba
--- /dev/null
+++ b/t/lib-disjoint.sh
@@ -0,0 +1,38 @@
+# Helpers for scripts testing disjoint packs; see t5319 for example usage.
+
+objdir=.git/objects
+
+test_disjoint_1 () {
+	local pack="$1"
+	local want="$2"
+
+	test-tool read-midx --bitmap $objdir >out &&
+	grep -A 3 "$pack" out >found &&
+
+	if ! test -s found
+	then
+		echo >&2 "could not find '$pack' in MIDX"
+		return 1
+	fi
+
+	if ! grep -q "disjoint: $want" found
+	then
+		echo >&2 "incorrect disjoint state for pack '$pack'"
+		return 1
+	fi
+	return 0
+}
+
+# test_must_be_disjoint <pack-$XYZ.pack>
+#
+# Ensures that the given pack is marked as disjoint.
+test_must_be_disjoint () {
+	test_disjoint_1 "$1" "yes"
+}
+
+# test_must_not_be_disjoint <pack-$XYZ.pack>
+#
+# Ensures that the given pack is not marked as disjoint.
+test_must_not_be_disjoint () {
+	test_disjoint_1 "$1" "no"
+}
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index fd24e0c952..02cfddf151 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -3,6 +3,7 @@
 test_description='multi-pack-indexes'
 . ./test-lib.sh
 . "$TEST_DIRECTORY"/lib-chunk.sh
+. "$TEST_DIRECTORY"/lib-disjoint.sh
 
 GIT_TEST_MULTI_PACK_INDEX=0
 objdir=.git/objects
@@ -1215,4 +1216,85 @@ test_expect_success 'non-disjoint packs are detected' '
 	)
 '
 
+test_expect_success 'retain disjoint packs while writing' '
+	test_when_finished "rm -fr repo" &&
+	git init repo &&
+	(
+		cd repo &&
+
+		for i in 1 2
+		do
+			test_commit "$i" && git repack -d || return 1
+		done &&
+
+		find $objdir/pack -type f -name "pack-*.idx" |
+		sed -e "s/^.*\/\(.*\)/\1/g" | sort >packs.old &&
+
+		test_line_count = 2 packs.old &&
+		disjoint="$(head -n 1 packs.old)" &&
+		non_disjoint="$(tail -n 1 packs.old)" &&
+
+		cat >in <<-EOF &&
+		+$disjoint
+		$non_disjoint
+		EOF
+		git multi-pack-index write --stdin-packs --bitmap <in &&
+
+		test_must_be_disjoint "${disjoint%.idx}.pack" &&
+		test_must_not_be_disjoint "${non_disjoint%.idx}.pack" &&
+
+		test_commit 3 &&
+		git repack -d &&
+
+		find $objdir/pack -type f -name "pack-*.idx" |
+		sed -e "s/^.*\/\(.*\)/\1/g" | sort >packs.new &&
+
+		new_disjoint="$(comm -13 packs.old packs.new)" &&
+		cat >in <<-EOF &&
+		$disjoint
+		$non_disjoint
+		+$new_disjoint
+		EOF
+		git multi-pack-index write --stdin-packs --bitmap \
+			--retain-disjoint <in &&
+
+		test_must_be_disjoint "${disjoint%.idx}.pack" &&
+		test_must_be_disjoint "${new_disjoint%.idx}.pack" &&
+		test_must_not_be_disjoint "${non_disjoint%.idx}.pack"
+
+	)
+'
+
+test_expect_success 'non-disjoint packs are detected via --retain-disjoint' '
+	test_when_finished "rm -fr repo" &&
+	git init repo &&
+	(
+		cd repo &&
+		packdir=.git/objects/pack &&
+
+		test_commit base &&
+		base="$(echo base | git pack-objects --revs $packdir/pack)" &&
+
+		cat >in <<-EOF &&
+		+pack-$base.idx
+		EOF
+		git multi-pack-index write --stdin-packs --bitmap <in &&
+
+		test_must_be_disjoint "pack-$base.pack" &&
+
+		test_commit other &&
+		other="$(echo other | git pack-objects --revs $packdir/pack)" &&
+
+		cat >in <<-EOF &&
+		pack-$base.idx
+		+pack-$other.idx
+		EOF
+		test_must_fail git multi-pack-index write --stdin-packs --retain-disjoint --bitmap <in 2>err &&
+		grep "duplicate object.* among disjoint packs" err &&
+
+		test_must_fail git multi-pack-index write --retain-disjoint --bitmap 2>err &&
+		grep "duplicate object.* among disjoint packs" err
+	)
+'
+
 test_done
-- 
2.43.0.24.g980b318f98


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 08/24] pack-objects: implement `--ignore-disjoint` mode
  2023-11-28 19:07 [PATCH 00/24] pack-objects: multi-pack verbatim reuse Taylor Blau
                   ` (6 preceding siblings ...)
  2023-11-28 19:08 ` [PATCH 07/24] midx: implement `--retain-disjoint` mode Taylor Blau
@ 2023-11-28 19:08 ` Taylor Blau
  2023-11-30 10:18   ` Patrick Steinhardt
  2023-11-28 19:08 ` [PATCH 09/24] repack: implement `--extend-disjoint` mode Taylor Blau
                   ` (18 subsequent siblings)
  26 siblings, 1 reply; 107+ messages in thread
From: Taylor Blau @ 2023-11-28 19:08 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano

Before multi-pack reachability bitmaps learn how to perform pack reuse
over the set of disjoint packs, we will need a way to generate packs
that are known to be disjoint with respect to the currently marked set
of disjoint packs.

In other words, we want a way to make a pack which does not have any
objects contained in the union of the set of packs which are currently
marked as disjoint.

There are a various ways that we could go about this, for example:

  - passing `--unpacked`, which would exclude all packed objects (and
    thus would not contain any objects from the disjoint pack)

  - passing `--stdin-packs` with the set of packs currently marked as
    disjoint as "excluded", indicating that `pack-objects` should
    discard any objects present in any of the excluded packs (thus
    producing a disjoint pack)

  - marking each of the disjoint packs as kept in-core with the
    `--keep-pack` flag, and then passing `--honor-pack-keep` to
    similarly ignore any object(s) from kept packs (thus also producing
    a pack which is disjoint with respect to the current set)

`git repack` is the main entry-point to generating a new pack, by
invoking `pack-objects` and then adding the new pack to the set of
disjoint packs if generating a new MIDX. However, `repack` has a number
of ways to invoke `pack-objects` (e.g., all-into-one repacks, geometric
repacks, incremental repacks, etc.), all of which would require careful
reasoning in order to prove that the resulting set of packs is disjoint.

The most appealing option of the above would be to pass the set of
disjoint packs as kept (via `--keep-pack`) and then ignore their
contents (with `--honor-pack-keep`), doing so for all kinds of
`pack-objects` invocations. But there may be more disjoint packs than we
can easily fit into the command-line arguments.

Instead, teach `pack-objects` a special `--ignore-disjoint` which is the
moral equivalent of marking the set of disjoint packs as kept, and
ignoring their contents, even if it would have otherwise been packed. In
fact, this similarity extends down to the implementation, where each
disjoint pack is first loaded, then has its `pack_keep_in_core` bit set.

With this in place, we can use the kept-pack cache from 20b031fede
(packfile: add kept-pack cache for find_kept_pack_entry(), 2021-02-22),
which looks up objects first in a cache containing just the set of kept
(in this case, disjoint) packs. Assuming that the set of disjoint packs
is a relatively small portion of the entire repository (which should be
a safe assumption to make), each object lookup will be very inexpensive.

The only place we want to avoid using `--ignore-disjoint` is in
conjunction with `--cruft`, since doing so may cause us to omit an
object which would have been included in a new cruft pack in order to
freshen it. In other words, failing to do so might cause that object to
be pruned from the repository earlier than expected.

Otherwise, `--ignore-disjoint` is compatible with most other modes of
`pack-objects`. These various combinations are tested below. As a
result, `repack` will be able to unconditionally (except for the cruft
pack) pass `--ignore-disjoint` when trying to add a new pack to the
disjoint set, and the result will be usable, without having to carefully
consider and reason about each individual case.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-pack-objects.txt |   8 ++
 builtin/pack-objects.c             |  31 +++++-
 t/lib-disjoint.sh                  |  11 ++
 t/t5331-pack-objects-stdin.sh      | 156 +++++++++++++++++++++++++++++
 4 files changed, 203 insertions(+), 3 deletions(-)

diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
index e32404c6aa..592c4ce742 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -96,6 +96,14 @@ base-name::
 Incompatible with `--revs`, or options that imply `--revs` (such as
 `--all`), with the exception of `--unpacked`, which is compatible.
 
+--ignore-disjoint::
+	This flag causes an object that appears in any pack marked as
+	"disjoint" by the multi-pack index to be ignored, even if it
+	would have otherwise been packed. When used with
+	`--stdin-packs`, objects from disjoint packs may be included if
+	and only if a disjoint pack is explicitly given as an input pack
+	to `--stdin-packs`. Incompatible with `--cruft`.
+
 --cruft::
 	Packs unreachable objects into a separate "cruft" pack, denoted
 	by the existence of a `.mtimes` file. Typically used by `git
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index bfa60359d4..107154db34 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -207,6 +207,7 @@ static int have_non_local_packs;
 static int incremental;
 static int ignore_packed_keep_on_disk;
 static int ignore_packed_keep_in_core;
+static int ignore_midx_disjoint_packs;
 static int allow_ofs_delta;
 static struct pack_idx_option pack_idx_opts;
 static const char *base_name;
@@ -1403,7 +1404,8 @@ static int want_found_object(const struct object_id *oid, int exclude,
 	/*
 	 * Then handle .keep first, as we have a fast(er) path there.
 	 */
-	if (ignore_packed_keep_on_disk || ignore_packed_keep_in_core) {
+	if (ignore_packed_keep_on_disk || ignore_packed_keep_in_core ||
+	    ignore_midx_disjoint_packs) {
 		/*
 		 * Set the flags for the kept-pack cache to be the ones we want
 		 * to ignore.
@@ -1415,7 +1417,7 @@ static int want_found_object(const struct object_id *oid, int exclude,
 		unsigned flags = 0;
 		if (ignore_packed_keep_on_disk)
 			flags |= ON_DISK_KEEP_PACKS;
-		if (ignore_packed_keep_in_core)
+		if (ignore_packed_keep_in_core || ignore_midx_disjoint_packs)
 			flags |= IN_CORE_KEEP_PACKS;
 
 		if (ignore_packed_keep_on_disk && p->pack_keep)
@@ -3389,6 +3391,7 @@ static void read_packs_list_from_stdin(void)
 			die(_("could not find pack '%s'"), item->string);
 		if (!is_pack_valid(p))
 			die(_("packfile %s cannot be accessed"), p->pack_name);
+		p->pack_keep_in_core = 0;
 	}
 
 	/*
@@ -4266,6 +4269,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 			 N_("create packs suitable for shallow fetches")),
 		OPT_BOOL(0, "honor-pack-keep", &ignore_packed_keep_on_disk,
 			 N_("ignore packs that have companion .keep file")),
+		OPT_BOOL(0, "ignore-disjoint", &ignore_midx_disjoint_packs,
+			 N_("ignore packs that are marked disjoint in the MIDX")),
 		OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"),
 				N_("ignore this pack")),
 		OPT_INTEGER(0, "compression", &pack_compression_level,
@@ -4412,7 +4417,9 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		if (use_internal_rev_list)
 			die(_("cannot use internal rev list with --cruft"));
 		if (stdin_packs)
-			die(_("cannot use --stdin-packs with --cruft"));
+			die(_("cannot use %s with %s"), "--stdin-packs", "--cruft");
+		if (ignore_midx_disjoint_packs)
+			die(_("cannot use %s with %s"), "--ignore-disjoint", "--cruft");
 	}
 
 	/*
@@ -4452,6 +4459,24 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		if (!p) /* no keep-able packs found */
 			ignore_packed_keep_on_disk = 0;
 	}
+	if (ignore_midx_disjoint_packs) {
+		struct multi_pack_index *m = get_multi_pack_index(the_repository);
+		struct bitmapped_pack pack;
+		unsigned any_disjoint = 0;
+		uint32_t i;
+
+		for (i = 0; m && m->chunk_disjoint_packs && i < m->num_packs; i++) {
+			if (nth_bitmapped_pack(the_repository, m, &pack, i) < 0)
+				die(_("could not load bitmapped pack %i"), i);
+			if (pack.disjoint) {
+				pack.p->pack_keep_in_core = 1;
+				any_disjoint = 1;
+			}
+		}
+
+		if (!any_disjoint) /* no disjoint packs to ignore */
+			ignore_midx_disjoint_packs = 0;
+	}
 	if (local) {
 		/*
 		 * unlike ignore_packed_keep_on_disk above, we do not
diff --git a/t/lib-disjoint.sh b/t/lib-disjoint.sh
index c6c6e74aba..c802ca6940 100644
--- a/t/lib-disjoint.sh
+++ b/t/lib-disjoint.sh
@@ -36,3 +36,14 @@ test_must_be_disjoint () {
 test_must_not_be_disjoint () {
 	test_disjoint_1 "$1" "no"
 }
+
+# packed_contents </path/to/pack-$XYZ.idx [...]>
+#
+# Prints the set of objects packed in the given pack indexes.
+packed_contents () {
+	for idx in "$@"
+	do
+		git show-index <$idx || return 1
+	done >tmp &&
+	cut -d" " -f2 <tmp | sort -u
+}
diff --git a/t/t5331-pack-objects-stdin.sh b/t/t5331-pack-objects-stdin.sh
index 2dcf1eecee..e522aa3f7d 100755
--- a/t/t5331-pack-objects-stdin.sh
+++ b/t/t5331-pack-objects-stdin.sh
@@ -6,6 +6,7 @@ export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
 
 TEST_PASSES_SANITIZE_LEAK=true
 . ./test-lib.sh
+. "$TEST_DIRECTORY"/lib-disjoint.sh
 
 packed_objects () {
 	git show-index <"$1" >tmp-object-list &&
@@ -237,4 +238,159 @@ test_expect_success 'pack-objects --stdin with packfiles from main and alternate
 	test_cmp expected-objects actual-objects
 '
 
+objdir=.git/objects
+packdir=$objdir/pack
+
+test_expect_success 'loose objects also in disjoint packs are ignored' '
+	test_when_finished "rm -fr repo" &&
+	git init repo &&
+	(
+		cd repo &&
+
+		# create a pack containing the objects in each commit below, but
+		# do not delete their loose copies
+		test_commit base &&
+		base_pack="$(echo base | git pack-objects --revs $packdir/pack)" &&
+
+		test_commit other &&
+		other_pack="$(echo base..other | git pack-objects --revs $packdir/pack)" &&
+
+		cat >in <<-EOF &&
+		pack-$base_pack.idx
+		+pack-$other_pack.idx
+		EOF
+		git multi-pack-index write --stdin-packs --bitmap <in &&
+
+		test_commit more &&
+		out="$(git pack-objects --all --ignore-disjoint $packdir/pack)" &&
+
+		# gather all objects in "all", and objects from the disjoint
+		# pack in "disjoint"
+		git cat-file --batch-all-objects --batch-check="%(objectname)" >all &&
+		packed_contents "$packdir/pack-$other_pack.idx" >disjoint &&
+
+		# make sure that the set of objects we just generated matches
+		# "all \ disjoint"
+		packed_contents "$packdir/pack-$out.idx" >got &&
+		comm -23 all disjoint >want &&
+		test_cmp want got
+	)
+'
+
+test_expect_success 'objects in disjoint packs are ignored (--unpacked)' '
+	test_when_finished "rm -fr repo" &&
+	git init repo &&
+	(
+		cd repo &&
+
+		for c in A B
+		do
+			test_commit "$c" || return 1
+		done &&
+
+		A="$(echo "A" | git pack-objects --revs $packdir/pack)" &&
+		B="$(echo "A..B" | git pack-objects --revs $packdir/pack)" &&
+
+		cat >in <<-EOF &&
+		pack-$A.idx
+		+pack-$B.idx
+		EOF
+		git multi-pack-index write --stdin-packs --bitmap <in &&
+
+		test_must_not_be_disjoint "pack-$A.pack" &&
+		test_must_be_disjoint "pack-$B.pack" &&
+
+		test_commit C &&
+
+		got="$(git pack-objects --all --unpacked --ignore-disjoint $packdir/pack)" &&
+		packed_contents "$packdir/pack-$got.idx" >actual &&
+
+		git rev-list --objects --no-object-names B..C >expect.raw &&
+		sort <expect.raw >expect &&
+
+		test_cmp expect actual
+	)
+'
+
+test_expect_success 'objects in disjoint packs are ignored (--stdin-packs)' '
+	# Create objects in three separate packs:
+	#
+	#   - pack A (midx, non disjoint)
+	#   - pack B (midx, disjoint)
+	#   - pack C (non-midx)
+	#
+	# Then create a new pack with `--stdin-packs` and `--ignore-disjoint`
+	# including packs A, B, and C. The resulting pack should contain
+	# only the objects from packs A, and C, excluding those from
+	# pack B as it is marked as disjoint.
+	test_when_finished "rm -fr repo" &&
+	git init repo &&
+	(
+		cd repo &&
+
+		for c in A B C
+		do
+			test_commit "$c" || return 1
+		done &&
+
+		A="$(echo "A" | git pack-objects --revs $packdir/pack)" &&
+		B="$(echo "A..B" | git pack-objects --revs $packdir/pack)" &&
+		C="$(echo "B..C" | git pack-objects --revs $packdir/pack)" &&
+
+		cat >in <<-EOF &&
+		pack-$A.idx
+		+pack-$B.idx
+		EOF
+		git multi-pack-index write --stdin-packs --bitmap <in &&
+
+		test_must_not_be_disjoint "pack-$A.pack" &&
+		test_must_be_disjoint "pack-$B.pack" &&
+
+		# Generate a pack with `--stdin-packs` using packs "A" and "C",
+		# but excluding objects from "B". The objects from pack "B" are
+		# expected to be omitted from the generated pack for two
+		# reasons:
+		#
+		#   - because it was specified as a negated tip via
+		#     `--stdin-packs`
+		#   - because it is a disjoint pack.
+		cat >in <<-EOF &&
+		pack-$A.pack
+		^pack-$B.pack
+		pack-$C.pack
+		EOF
+		got="$(git pack-objects --stdin-packs --ignore-disjoint $packdir/pack <in)" &&
+
+		packed_contents "$packdir/pack-$got.idx" >actual &&
+		packed_contents "$packdir/pack-$A.idx" \
+				"$packdir/pack-$C.idx" >expect &&
+		test_cmp expect actual &&
+
+		# Generate another pack with `--stdin-packs`, this time
+		# using packs "B" and "C". The objects from pack "B" are
+		# expected to be in the final pack, despite it being a
+		# disjoint pack, because "B" was mentioned explicitly
+		# via `stdin-packs`.
+		cat >in <<-EOF &&
+		pack-$B.pack
+		pack-$C.pack
+		EOF
+		got="$(git pack-objects --stdin-packs --ignore-disjoint $packdir/pack <in)" &&
+
+		packed_contents "$packdir/pack-$got.idx" >actual &&
+		packed_contents "$packdir/pack-$B.idx" \
+				"$packdir/pack-$C.idx" >expect &&
+		test_cmp expect actual
+	)
+'
+
+test_expect_success '--cruft is incompatible with --ignore-disjoint' '
+	test_must_fail git pack-objects --cruft --ignore-disjoint --stdout \
+		</dev/null >/dev/null 2>actual &&
+	cat >expect <<-\EOF &&
+	fatal: cannot use --ignore-disjoint with --cruft
+	EOF
+	test_cmp expect actual
+'
+
 test_done
-- 
2.43.0.24.g980b318f98


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 09/24] repack: implement `--extend-disjoint` mode
  2023-11-28 19:07 [PATCH 00/24] pack-objects: multi-pack verbatim reuse Taylor Blau
                   ` (7 preceding siblings ...)
  2023-11-28 19:08 ` [PATCH 08/24] pack-objects: implement `--ignore-disjoint` mode Taylor Blau
@ 2023-11-28 19:08 ` Taylor Blau
  2023-12-07 13:13   ` Patrick Steinhardt
  2023-11-28 19:08 ` [PATCH 10/24] pack-bitmap: pass `bitmapped_pack` struct to pack-reuse functions Taylor Blau
                   ` (17 subsequent siblings)
  26 siblings, 1 reply; 107+ messages in thread
From: Taylor Blau @ 2023-11-28 19:08 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano

Now that we can generate packs which are disjoint with respect to the
set of currently-disjoint packs, implement a mode of `git repack` which
extends the set of disjoint packs with any new (non-cruft) pack(s)
generated during the repack.

The idea is mostly straightforward, with a couple of gotcha's. The
straightforward part is to make sure that any new packs are disjoint
with respect to the set of currently disjoint packs which are _not_
being removed from the repository as a result of the repack.

If a pack which is currently marked as disjoint is, on the other hand,
about to be removed from the repository, it is OK (and expected) that
new pack(s) will contain some or all of its objects. Since the pack
originally marked as disjoint will be removed, it will necessarily leave
the disjoint set, making room for new packs with its same objects to
take its place. In other words, the resulting set of disjoint packs will
be disjoint with respect to one another.

The gotchas mostly have to do with making sure that we do not generate a
disjoint pack in the following scenarios:

  - promisor packs
  - cruft packs (which may necessarily need to include an object from a
    disjoint pack in order to freshen it in certain circumstances)
  - all-into-one repacks without '-d'
  - `--filter-to`, which conceptually could work with the new
    `--extend-disjoint` option, but only in limited circumstances

Otherwise, we mark which packs were created as disjoint by using a new
bit in the `generated_pack_data` struct, and then marking those pack(s)
as disjoint accordingly when generating the MIDX. Non-deleted packs
which are marked as disjoint are retained as such by passing the
equivalent of `--retain-disjoint` when calling the MIDX API to update
the MIDX.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-repack.txt      |  12 +++
 builtin/repack.c                  |  57 +++++++++---
 t/t7700-repack.sh                 |   4 +-
 t/t7705-repack-extend-disjoint.sh | 142 ++++++++++++++++++++++++++++++
 4 files changed, 203 insertions(+), 12 deletions(-)
 create mode 100755 t/t7705-repack-extend-disjoint.sh

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index c902512a9e..50ba5e7f9c 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -249,6 +249,18 @@ linkgit:git-multi-pack-index[1]).
 	Write a multi-pack index (see linkgit:git-multi-pack-index[1])
 	containing the non-redundant packs.
 
+--extend-disjoint::
+	Extends the set of disjoint packs. All new non-cruft pack(s)
+	generated are constructed to be disjoint with respect to the set
+	of currently disjoint packs, excluding any packs that will be
+	removed as a result of the repack operation. For more on
+	disjoint packs, see the details in linkgit:gitformat-pack[5],
+	under the section "`DISP` chunk and disjoint packs".
++
+Useful only with the combination of `--write-midx` and
+`--write-bitmap-index`. Incompatible with `--filter-to`. Incompatible
+with `-A`, `-a`, or `--cruft` unless `-d` is given.
+
 CONFIGURATION
 -------------
 
diff --git a/builtin/repack.c b/builtin/repack.c
index edaee4dbec..0601bd16c4 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -58,6 +58,7 @@ struct pack_objects_args {
 	int no_reuse_object;
 	int quiet;
 	int local;
+	int ignore_disjoint;
 	struct list_objects_filter_options filter_options;
 };
 
@@ -293,6 +294,8 @@ static void prepare_pack_objects(struct child_process *cmd,
 		strvec_push(&cmd->args,  "--local");
 	if (args->quiet)
 		strvec_push(&cmd->args,  "--quiet");
+	if (args->ignore_disjoint)
+		strvec_push(&cmd->args,  "--ignore-disjoint");
 	if (delta_base_offset)
 		strvec_push(&cmd->args,  "--delta-base-offset");
 	strvec_push(&cmd->args, out);
@@ -334,9 +337,11 @@ static struct {
 
 struct generated_pack_data {
 	struct tempfile *tempfiles[ARRAY_SIZE(exts)];
+	unsigned disjoint : 1;
 };
 
-static struct generated_pack_data *populate_pack_exts(const char *name)
+static struct generated_pack_data *populate_pack_exts(const char *name,
+						      unsigned disjoint)
 {
 	struct stat statbuf;
 	struct strbuf path = STRBUF_INIT;
@@ -353,6 +358,8 @@ static struct generated_pack_data *populate_pack_exts(const char *name)
 		data->tempfiles[i] = register_tempfile(path.buf);
 	}
 
+	data->disjoint = disjoint;
+
 	strbuf_release(&path);
 	return data;
 }
@@ -379,6 +386,8 @@ static void repack_promisor_objects(const struct pack_objects_args *args,
 	prepare_pack_objects(&cmd, args, packtmp);
 	cmd.in = -1;
 
+	strvec_pushf(&cmd.args, "--no-ignore-disjoint");
+
 	/*
 	 * NEEDSWORK: Giving pack-objects only the OIDs without any ordering
 	 * hints may result in suboptimal deltas in the resulting pack. See if
@@ -421,7 +430,7 @@ static void repack_promisor_objects(const struct pack_objects_args *args,
 					  line.buf);
 		write_promisor_file(promisor_name, NULL, 0);
 
-		item->util = populate_pack_exts(item->string);
+		item->util = populate_pack_exts(item->string, 0);
 
 		free(promisor_name);
 	}
@@ -731,8 +740,13 @@ static void midx_included_packs(struct string_list *include,
 
 	for_each_string_list_item(item, &existing->kept_packs)
 		string_list_insert(include, xstrfmt("%s.idx", item->string));
-	for_each_string_list_item(item, names)
-		string_list_insert(include, xstrfmt("pack-%s.idx", item->string));
+	for_each_string_list_item(item, names) {
+		const char *marker = "";
+		struct generated_pack_data *data = item->util;
+		if (data->disjoint)
+			marker = "+";
+		string_list_insert(include, xstrfmt("%spack-%s.idx", marker, item->string));
+	}
 	if (geometry->split_factor) {
 		struct strbuf buf = STRBUF_INIT;
 		uint32_t i;
@@ -788,7 +802,8 @@ static int write_midx_included_packs(struct string_list *include,
 				     struct pack_geometry *geometry,
 				     struct string_list *names,
 				     const char *refs_snapshot,
-				     int show_progress, int write_bitmaps)
+				     int show_progress, int write_bitmaps,
+				     int exclude_disjoint)
 {
 	struct child_process cmd = CHILD_PROCESS_INIT;
 	struct string_list_item *item;
@@ -852,6 +867,9 @@ static int write_midx_included_packs(struct string_list *include,
 	if (refs_snapshot)
 		strvec_pushf(&cmd.args, "--refs-snapshot=%s", refs_snapshot);
 
+	if (exclude_disjoint)
+		strvec_push(&cmd.args, "--retain-disjoint");
+
 	ret = start_command(&cmd);
 	if (ret)
 		return ret;
@@ -895,7 +913,7 @@ static void remove_redundant_bitmaps(struct string_list *include,
 
 static int finish_pack_objects_cmd(struct child_process *cmd,
 				   struct string_list *names,
-				   int local)
+				   int local, int disjoint)
 {
 	FILE *out;
 	struct strbuf line = STRBUF_INIT;
@@ -913,7 +931,7 @@ static int finish_pack_objects_cmd(struct child_process *cmd,
 		 */
 		if (local) {
 			item = string_list_append(names, line.buf);
-			item->util = populate_pack_exts(line.buf);
+			item->util = populate_pack_exts(line.buf, disjoint);
 		}
 	}
 	fclose(out);
@@ -970,7 +988,7 @@ static int write_filtered_pack(const struct pack_objects_args *args,
 		fprintf(in, "%s%s.pack\n", caret, item->string);
 	fclose(in);
 
-	return finish_pack_objects_cmd(&cmd, names, local);
+	return finish_pack_objects_cmd(&cmd, names, local, 0);
 }
 
 static int existing_cruft_pack_cmp(const void *va, const void *vb)
@@ -1098,7 +1116,7 @@ static int write_cruft_pack(const struct pack_objects_args *args,
 		fprintf(in, "%s.pack\n", item->string);
 	fclose(in);
 
-	return finish_pack_objects_cmd(&cmd, names, local);
+	return finish_pack_objects_cmd(&cmd, names, local, 0);
 }
 
 static const char *find_pack_prefix(const char *packdir, const char *packtmp)
@@ -1190,6 +1208,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 			   N_("pack prefix to store a pack containing pruned objects")),
 		OPT_STRING(0, "filter-to", &filter_to, N_("dir"),
 			   N_("pack prefix to store a pack containing filtered out objects")),
+		OPT_BOOL(0, "extend-disjoint", &po_args.ignore_disjoint,
+			 N_("add new packs to the set of disjoint ones")),
 		OPT_END()
 	};
 
@@ -1255,6 +1275,16 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		strbuf_release(&path);
 	}
 
+	if (po_args.ignore_disjoint) {
+		if (filter_to)
+			die(_("options '%s' and '%s' cannot be used together"),
+			    "--filter-to", "--extend-disjoint");
+		if (pack_everything && !delete_redundant)
+			die(_("cannot use '--extend-disjoint' with '%s' but not '-d'"),
+			    pack_everything & LOOSEN_UNREACHABLE ? "-A" :
+			    pack_everything & PACK_CRUFT ? "--cruft" : "-a");
+	}
+
 	packdir = mkpathdup("%s/pack", get_object_directory());
 	packtmp_name = xstrfmt(".tmp-%d-pack", (int)getpid());
 	packtmp = mkpathdup("%s/%s", packdir, packtmp_name);
@@ -1308,6 +1338,9 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	if (pack_everything & ALL_INTO_ONE) {
 		repack_promisor_objects(&po_args, &names);
 
+		if (delete_redundant)
+			strvec_pushf(&cmd.args, "--no-ignore-disjoint");
+
 		if (has_existing_non_kept_packs(&existing) &&
 		    delete_redundant &&
 		    !(pack_everything & PACK_CRUFT)) {
@@ -1364,7 +1397,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		fclose(in);
 	}
 
-	ret = finish_pack_objects_cmd(&cmd, &names, 1);
+	ret = finish_pack_objects_cmd(&cmd, &names, 1, po_args.ignore_disjoint);
 	if (ret)
 		goto cleanup;
 
@@ -1387,6 +1420,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 
 		cruft_po_args.local = po_args.local;
 		cruft_po_args.quiet = po_args.quiet;
+		cruft_po_args.ignore_disjoint = 0;
 
 		ret = write_cruft_pack(&cruft_po_args, packtmp, pack_prefix,
 				       cruft_expiration, &names,
@@ -1487,7 +1521,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 
 		ret = write_midx_included_packs(&include, &geometry, &names,
 						refs_snapshot ? get_tempfile_path(refs_snapshot) : NULL,
-						show_progress, write_bitmaps > 0);
+						show_progress, write_bitmaps > 0,
+						po_args.ignore_disjoint);
 
 		if (!ret && write_bitmaps)
 			remove_redundant_bitmaps(&include, packdir);
diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
index d2975e6c93..277f1ff1d7 100755
--- a/t/t7700-repack.sh
+++ b/t/t7700-repack.sh
@@ -6,6 +6,7 @@ test_description='git repack works correctly'
 . "${TEST_DIRECTORY}/lib-bitmap.sh"
 . "${TEST_DIRECTORY}/lib-midx.sh"
 . "${TEST_DIRECTORY}/lib-terminal.sh"
+. "${TEST_DIRECTORY}/lib-disjoint.sh"
 
 commit_and_pack () {
 	test_commit "$@" 1>&2 &&
@@ -525,7 +526,8 @@ test_expect_success '--filter works with --max-pack-size' '
 '
 
 objdir=.git/objects
-midx=$objdir/pack/multi-pack-index
+packdir=$objdir/pack
+midx=$packdir/multi-pack-index
 
 test_expect_success 'setup for --write-midx tests' '
 	git init midx &&
diff --git a/t/t7705-repack-extend-disjoint.sh b/t/t7705-repack-extend-disjoint.sh
new file mode 100755
index 0000000000..0c8be1cb3f
--- /dev/null
+++ b/t/t7705-repack-extend-disjoint.sh
@@ -0,0 +1,142 @@
+#!/bin/sh
+
+test_description='git repack --extend-disjoint works correctly'
+
+. ./test-lib.sh
+. "${TEST_DIRECTORY}/lib-disjoint.sh"
+
+packdir=.git/objects/pack
+
+GIT_TEST_MULTI=0
+GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=0
+
+test_expect_success 'repack --extend-disjoint creates new disjoint packs' '
+	git init repo &&
+	(
+		cd repo &&
+
+		test_commit A &&
+		test_commit B &&
+
+		A="$(echo A | git pack-objects --revs $packdir/pack)" &&
+		B="$(echo A..B | git pack-objects --revs $packdir/pack)" &&
+
+		git prune-packed &&
+
+		cat >in <<-EOF &&
+		pack-$A.idx
+		+pack-$B.idx
+		EOF
+		git multi-pack-index write --bitmap --stdin-packs <in &&
+
+		test_must_not_be_disjoint "pack-$A.pack" &&
+		test_must_be_disjoint "pack-$B.pack" &&
+
+		test_commit C &&
+
+		find $packdir -type f -name "*.idx" | sort >packs.before &&
+		git repack --write-midx --write-bitmap-index --extend-disjoint &&
+		find $packdir -type f -name "*.idx" | sort >packs.after &&
+
+		comm -13 packs.before packs.after >packs.new &&
+
+		test_line_count = 1 packs.new &&
+
+		test_must_not_be_disjoint "pack-$A.pack" &&
+		test_must_be_disjoint "pack-$B.pack" &&
+		test_must_be_disjoint "$(basename $(cat packs.new) .idx).pack"
+	)
+'
+
+test_expect_success 'repack --extend-disjoint combines existing disjoint packs' '
+	(
+		cd repo &&
+
+		test_commit D &&
+
+		git repack -a -d --write-midx --write-bitmap-index --extend-disjoint &&
+
+		find $packdir -type f -name "*.pack" >packs &&
+		test_line_count = 1 packs &&
+
+		test_must_be_disjoint "$(basename $(cat packs))"
+
+	)
+'
+
+test_expect_success 'repack --extend-disjoint with --geometric' '
+	git init disjoint-geometric &&
+	(
+		cd disjoint-geometric &&
+
+		test_commit_bulk 8 &&
+		base="$(basename $(ls $packdir/pack-*.idx))" &&
+		echo "+$base" >>in &&
+
+		test_commit A &&
+		A="$(echo HEAD^.. | git pack-objects --revs $packdir/pack)" &&
+		test_commit B &&
+		B="$(echo HEAD^.. | git pack-objects --revs $packdir/pack)" &&
+
+		git prune-packed &&
+
+		cat >>in <<-EOF &&
+		+pack-$A.idx
+		+pack-$B.idx
+		EOF
+		git multi-pack-index write --bitmap --stdin-packs <in &&
+
+		test_must_be_disjoint "pack-$A.pack" &&
+		test_must_be_disjoint "pack-$B.pack" &&
+		test_must_be_disjoint "${base%.idx}.pack" &&
+
+		test_commit C &&
+
+		find $packdir -type f -name "*.pack" | sort >packs.before &&
+		git repack --geometric=2 -d --write-midx --write-bitmap-index --extend-disjoint &&
+		find $packdir -type f -name "*.pack" | sort >packs.after &&
+
+		comm -12 packs.before packs.after >packs.unchanged &&
+		comm -23 packs.before packs.after >packs.removed &&
+		comm -13 packs.before packs.after >packs.new &&
+
+		cat >expect <<-EOF &&
+		$packdir/${base%.idx}.pack
+		EOF
+		test_cmp expect packs.unchanged &&
+
+		sort >expect <<-EOF &&
+		$packdir/pack-$A.pack
+		$packdir/pack-$B.pack
+		EOF
+		test_cmp expect packs.removed &&
+
+		test_line_count = 1 packs.new &&
+
+		test_must_be_disjoint "$(basename $(cat packs.new))" &&
+		test_must_be_disjoint "${base%.idx}.pack"
+	)
+'
+
+for flag in "-A" "-a" "--cruft"
+do
+	test_expect_success "repack --extend-disjoint incompatible with $flag without -d" '
+		test_must_fail git repack $flag --extend-disjoint \
+			--write-midx --write-bitmap-index 2>actual &&
+		cat >expect <<-EOF &&
+		fatal: cannot use $SQ--extend-disjoint$SQ with $SQ$flag$SQ but not $SQ-d$SQ
+		EOF
+		test_cmp expect actual
+	'
+done
+
+test_expect_success 'repack --extend-disjoint is incompatible with --filter-to' '
+	test_must_fail git repack --extend-disjoint --filter-to=dir 2>actual &&
+
+	cat >expect <<-EOF &&
+	fatal: options $SQ--filter-to$SQ and $SQ--extend-disjoint$SQ cannot be used together
+	EOF
+	test_cmp expect actual
+'
+
+test_done
-- 
2.43.0.24.g980b318f98


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 10/24] pack-bitmap: pass `bitmapped_pack` struct to pack-reuse functions
  2023-11-28 19:07 [PATCH 00/24] pack-objects: multi-pack verbatim reuse Taylor Blau
                   ` (8 preceding siblings ...)
  2023-11-28 19:08 ` [PATCH 09/24] repack: implement `--extend-disjoint` mode Taylor Blau
@ 2023-11-28 19:08 ` Taylor Blau
  2023-12-07 13:13   ` Patrick Steinhardt
  2023-11-28 19:08 ` [PATCH 11/24] pack-bitmap: simplify `reuse_partial_packfile_from_bitmap()` signature Taylor Blau
                   ` (16 subsequent siblings)
  26 siblings, 1 reply; 107+ messages in thread
From: Taylor Blau @ 2023-11-28 19:08 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano

When trying to assemble a pack with bitmaps using `--use-bitmap-index`,
`pack-objects` asks the pack-bitmap machinery for a bitmap which
indicates the set of objects we can "reuse" verbatim from on-disk.

This set is roughly comprised of: a prefix of objects in the bitmapped
pack (or preferred pack, in the case of a multi-pack reachability
bitmap), plus any other objects not included in the prefix, excluding
any deltas whose base we are not sending in the resulting pack.

The pack-bitmap machinery is responsible for computing this bitmap, and
does so with the following functions:

  - reuse_partial_packfile_from_bitmap()
  - try_partial_reuse()

In the existing implementation, the first function is responsible for
(a) marking the prefix of objects in the reusable pack, and then (b)
calling try_partial_reuse() on any remaining objects to ensure that they
are also reusable (and removing them from the bitmapped set if they are
not).

Likewise, the `try_partial_reuse()` function is responsible for checking
whether an isolated object (that is, an object from the bitmapped
pack/preferred pack not contained in the prefix from earlier) may be
reused, i.e. that it isn't a delta of an object that we are not sending
in the resulting pack.

These functions are based on two core assumptions, which we will unwind
in this and the following commits:

  1. There is only a single pack from the bitmap which is eligible for
     verbatim pack-reuse. For single-pack bitmaps, this is trivially the
     bitmapped pack. For multi-pack bitmaps, this is (currently) the
     MIDX's preferred pack.

  2. The pack eligible for reuse has its first object in bit position 0,
     and all objects from that pack follow in pack-order from that first
     bit position.

In order to perform verbatim pack reuse over multiple packs, we must
unwind these two assumptions. Most notably, in order to reuse bits from
a given packfile, we need to know the first bit position occupied by
an object form that packfile. To propagate this information around, pass
a `struct bitmapped_pack *` anywhere we previously passed a `struct
packed_git *`, since the former contains the bitmap position we're
interested in (as well as a pointer to the latter).

As an additional step, factor out a sub-routine from the main
`reuse_partial_packfile_from_bitmap()` function, called
`reuse_partial_packfile_from_bitmap_1()`. This new function will be
responsible for figuring out which objects may be reused from a single
pack, and the existing function will dispatch multiple calls to its new
helper function for each reusable pack.

Consequently, `reuse_partial_packfile_from_bitmap()` will now maintain
an array of reusable packs instead of a single such pack. We currently
expect that array to have only a single element, so this awkward state
is short-lived. It will serve as useful scaffolding in subsequent
commits as we begin to work towards enabling multi-pack reuse.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-bitmap.c | 105 +++++++++++++++++++++++++++++++++++---------------
 1 file changed, 74 insertions(+), 31 deletions(-)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index d2f1306960..2ebe2c314e 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -1836,7 +1836,7 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
  * -1 means "stop trying further objects"; 0 means we may or may not have
  * reused, but you can keep feeding bits.
  */
-static int try_partial_reuse(struct packed_git *pack,
+static int try_partial_reuse(struct bitmapped_pack *pack,
 			     size_t pos,
 			     struct bitmap *reuse,
 			     struct pack_window **w_curs)
@@ -1868,11 +1868,11 @@ static int try_partial_reuse(struct packed_git *pack,
 	 * preferred pack precede all bits from other packs.
 	 */
 
-	if (pos >= pack->num_objects)
+	if (pos >= pack->p->num_objects)
 		return -1; /* not actually in the pack or MIDX preferred pack */
 
-	offset = delta_obj_offset = pack_pos_to_offset(pack, pos);
-	type = unpack_object_header(pack, w_curs, &offset, &size);
+	offset = delta_obj_offset = pack_pos_to_offset(pack->p, pos);
+	type = unpack_object_header(pack->p, w_curs, &offset, &size);
 	if (type < 0)
 		return -1; /* broken packfile, punt */
 
@@ -1888,11 +1888,11 @@ static int try_partial_reuse(struct packed_git *pack,
 		 * and the normal slow path will complain about it in
 		 * more detail.
 		 */
-		base_offset = get_delta_base(pack, w_curs, &offset, type,
+		base_offset = get_delta_base(pack->p, w_curs, &offset, type,
 					     delta_obj_offset);
 		if (!base_offset)
 			return 0;
-		if (offset_to_pack_pos(pack, base_offset, &base_pos) < 0)
+		if (offset_to_pack_pos(pack->p, base_offset, &base_pos) < 0)
 			return 0;
 
 		/*
@@ -1915,14 +1915,14 @@ static int try_partial_reuse(struct packed_git *pack,
 		 * to REF_DELTA on the fly. Better to just let the normal
 		 * object_entry code path handle it.
 		 */
-		if (!bitmap_get(reuse, base_pos))
+		if (!bitmap_get(reuse, pack->bitmap_pos + base_pos))
 			return 0;
 	}
 
 	/*
 	 * If we got here, then the object is OK to reuse. Mark it.
 	 */
-	bitmap_set(reuse, pos);
+	bitmap_set(reuse, pack->bitmap_pos + pos);
 	return 0;
 }
 
@@ -1934,29 +1934,13 @@ uint32_t midx_preferred_pack(struct bitmap_index *bitmap_git)
 	return nth_midxed_pack_int_id(m, pack_pos_to_midx(bitmap_git->midx, 0));
 }
 
-int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
-				       struct packed_git **packfile_out,
-				       uint32_t *entries,
-				       struct bitmap **reuse_out)
+static void reuse_partial_packfile_from_bitmap_1(struct bitmap_index *bitmap_git,
+						 struct bitmapped_pack *pack,
+						 struct bitmap *reuse)
 {
-	struct repository *r = the_repository;
-	struct packed_git *pack;
 	struct bitmap *result = bitmap_git->result;
-	struct bitmap *reuse;
 	struct pack_window *w_curs = NULL;
 	size_t i = 0;
-	uint32_t offset;
-	uint32_t objects_nr;
-
-	assert(result);
-
-	load_reverse_index(r, bitmap_git);
-
-	if (bitmap_is_midx(bitmap_git))
-		pack = bitmap_git->midx->packs[midx_preferred_pack(bitmap_git)];
-	else
-		pack = bitmap_git->pack;
-	objects_nr = pack->num_objects;
 
 	while (i < result->word_alloc && result->words[i] == (eword_t)~0)
 		i++;
@@ -1969,15 +1953,15 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 	 * we use it instead of another pack. In single-pack bitmaps, the choice
 	 * is made for us.
 	 */
-	if (i > objects_nr / BITS_IN_EWORD)
-		i = objects_nr / BITS_IN_EWORD;
+	if (i > pack->p->num_objects / BITS_IN_EWORD)
+		i = pack->p->num_objects / BITS_IN_EWORD;
 
-	reuse = bitmap_word_alloc(i);
 	memset(reuse->words, 0xFF, i * sizeof(eword_t));
 
 	for (; i < result->word_alloc; ++i) {
 		eword_t word = result->words[i];
 		size_t pos = (i * BITS_IN_EWORD);
+		size_t offset;
 
 		for (offset = 0; offset < BITS_IN_EWORD; ++offset) {
 			if ((word >> offset) == 0)
@@ -2002,6 +1986,65 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 
 done:
 	unuse_pack(&w_curs);
+}
+
+int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
+				       struct packed_git **packfile_out,
+				       uint32_t *entries,
+				       struct bitmap **reuse_out)
+{
+	struct repository *r = the_repository;
+	struct bitmapped_pack *packs = NULL;
+	struct bitmap *result = bitmap_git->result;
+	struct bitmap *reuse;
+	size_t i;
+	size_t packs_nr = 0, packs_alloc = 0;
+	size_t word_alloc;
+	uint32_t objects_nr = 0;
+
+	assert(result);
+
+	load_reverse_index(r, bitmap_git);
+
+	if (bitmap_is_midx(bitmap_git)) {
+		for (i = 0; i < bitmap_git->midx->num_packs; i++) {
+			struct bitmapped_pack pack;
+			if (nth_bitmapped_pack(r, bitmap_git->midx, &pack, i) < 0) {
+				warning(_("unable to load pack: '%s', disabling pack-reuse"),
+					bitmap_git->midx->pack_names[i]);
+				free(packs);
+				return -1;
+			}
+			if (!pack.bitmap_nr)
+				continue; /* no objects from this pack */
+			if (pack.bitmap_pos)
+				continue; /* not preferred pack */
+
+			ALLOC_GROW(packs, packs_nr + 1, packs_alloc);
+			memcpy(&packs[packs_nr++], &pack, sizeof(pack));
+
+			objects_nr += pack.p->num_objects;
+		}
+	} else {
+		ALLOC_GROW(packs, packs_nr + 1, packs_alloc);
+
+		packs[packs_nr].p = bitmap_git->pack;
+		packs[packs_nr].bitmap_pos = 0;
+		packs[packs_nr].bitmap_nr = bitmap_git->pack->num_objects;
+		packs[packs_nr].disjoint = 1;
+
+		objects_nr = packs[packs_nr++].p->num_objects;
+	}
+
+	word_alloc = objects_nr / BITS_IN_EWORD;
+	if (objects_nr % BITS_IN_EWORD)
+		word_alloc++;
+	reuse = bitmap_word_alloc(word_alloc);
+
+	if (packs_nr != 1)
+		BUG("pack reuse not yet implemented for multiple packs");
+
+	reuse_partial_packfile_from_bitmap_1(bitmap_git, packs, reuse);
 
 	*entries = bitmap_popcount(reuse);
 	if (!*entries) {
@@ -2014,7 +2057,7 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 	 * need to be handled separately.
 	 */
 	bitmap_and_not(result, reuse);
-	*packfile_out = pack;
+	*packfile_out = packs[0].p;
 	*reuse_out = reuse;
 	return 0;
 }
-- 
2.43.0.24.g980b318f98


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 11/24] pack-bitmap: simplify `reuse_partial_packfile_from_bitmap()` signature
  2023-11-28 19:07 [PATCH 00/24] pack-objects: multi-pack verbatim reuse Taylor Blau
                   ` (9 preceding siblings ...)
  2023-11-28 19:08 ` [PATCH 10/24] pack-bitmap: pass `bitmapped_pack` struct to pack-reuse functions Taylor Blau
@ 2023-11-28 19:08 ` Taylor Blau
  2023-12-07 13:13   ` Patrick Steinhardt
  2023-11-28 19:08 ` [PATCH 12/24] pack-bitmap: return multiple packs via `reuse_partial_packfile_from_bitmap()` Taylor Blau
                   ` (15 subsequent siblings)
  26 siblings, 1 reply; 107+ messages in thread
From: Taylor Blau @ 2023-11-28 19:08 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano

The signature of `reuse_partial_packfile_from_bitmap()` currently takes
in a bitmap, as well as three output parameters (filled through
pointers, and passed as arguments), and also returns an integer result.

The output parameters are filled out with: (a) the packfile used for
pack-reuse, (b) the number of objects from that pack that we can reuse,
and (c) a bitmap indicating which objects we can reuse. The return value
is either -1 (when there are no objects to reuse), or 0 (when there is
at least one object to reuse).

Some of these parameters are redundant. Notably, we can infer from the
bitmap how many objects are reused by calling bitmap_popcount(). And we
can similar compute the return value based on that number as well.

As such, clean up the signature of this function to drop the "*entries"
parameter, as well as the int return value, since the single caller of
this function can infer these values themself.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 16 +++++++++-------
 pack-bitmap.c          | 16 +++++++---------
 pack-bitmap.h          |  7 +++----
 3 files changed, 19 insertions(+), 20 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 107154db34..2bb1b64e8f 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3946,13 +3946,15 @@ static int get_object_list_from_bitmap(struct rev_info *revs)
 	if (!(bitmap_git = prepare_bitmap_walk(revs, 0)))
 		return -1;
 
-	if (pack_options_allow_reuse() &&
-	    !reuse_partial_packfile_from_bitmap(
-			bitmap_git,
-			&reuse_packfile,
-			&reuse_packfile_objects,
-			&reuse_packfile_bitmap)) {
-		assert(reuse_packfile_objects);
+	if (pack_options_allow_reuse())
+		reuse_partial_packfile_from_bitmap(bitmap_git, &reuse_packfile,
+						   &reuse_packfile_bitmap);
+
+	if (reuse_packfile) {
+		reuse_packfile_objects = bitmap_popcount(reuse_packfile_bitmap);
+		if (!reuse_packfile_objects)
+			BUG("expected non-empty reuse bitmap");
+
 		nr_result += reuse_packfile_objects;
 		nr_seen += reuse_packfile_objects;
 		display_progress(progress_state, nr_seen);
diff --git a/pack-bitmap.c b/pack-bitmap.c
index 2ebe2c314e..614fc09a4e 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -1988,10 +1988,9 @@ static void reuse_partial_packfile_from_bitmap_1(struct bitmap_index *bitmap_git
 	unuse_pack(&w_curs);
 }
 
-int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
-				       struct packed_git **packfile_out,
-				       uint32_t *entries,
-				       struct bitmap **reuse_out)
+void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
+					struct packed_git **packfile_out,
+					struct bitmap **reuse_out)
 {
 	struct repository *r = the_repository;
 	struct bitmapped_pack *packs = NULL;
@@ -2013,7 +2012,7 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 				warning(_("unable to load pack: '%s', disabling pack-reuse"),
 					bitmap_git->midx->pack_names[i]);
 				free(packs);
-				return -1;
+				return;
 			}
 			if (!pack.bitmap_nr)
 				continue; /* no objects from this pack */
@@ -2046,10 +2045,10 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 
 	reuse_partial_packfile_from_bitmap_1(bitmap_git, packs, reuse);
 
-	*entries = bitmap_popcount(reuse);
-	if (!*entries) {
+	if (!bitmap_popcount(reuse)) {
+		free(packs);
 		bitmap_free(reuse);
-		return -1;
+		return;
 	}
 
 	/*
@@ -2059,7 +2058,6 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 	bitmap_and_not(result, reuse);
 	*packfile_out = packs[0].p;
 	*reuse_out = reuse;
-	return 0;
 }
 
 int bitmap_walk_contains(struct bitmap_index *bitmap_git,
diff --git a/pack-bitmap.h b/pack-bitmap.h
index b7fa1a42a9..5bc1ca5b65 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -78,10 +78,9 @@ int test_bitmap_hashes(struct repository *r);
 struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
 					 int filter_provided_objects);
 uint32_t midx_preferred_pack(struct bitmap_index *bitmap_git);
-int reuse_partial_packfile_from_bitmap(struct bitmap_index *,
-				       struct packed_git **packfile,
-				       uint32_t *entries,
-				       struct bitmap **reuse_out);
+void reuse_partial_packfile_from_bitmap(struct bitmap_index *,
+					struct packed_git **packfile,
+					struct bitmap **reuse_out);
 int rebuild_existing_bitmaps(struct bitmap_index *, struct packing_data *mapping,
 			     kh_oid_map_t *reused_bitmaps, int show_progress);
 void free_bitmap_index(struct bitmap_index *);
-- 
2.43.0.24.g980b318f98


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 12/24] pack-bitmap: return multiple packs via `reuse_partial_packfile_from_bitmap()`
  2023-11-28 19:07 [PATCH 00/24] pack-objects: multi-pack verbatim reuse Taylor Blau
                   ` (10 preceding siblings ...)
  2023-11-28 19:08 ` [PATCH 11/24] pack-bitmap: simplify `reuse_partial_packfile_from_bitmap()` signature Taylor Blau
@ 2023-11-28 19:08 ` Taylor Blau
  2023-11-28 19:08 ` [PATCH 13/24] pack-objects: parameterize pack-reuse routines over a single pack Taylor Blau
                   ` (14 subsequent siblings)
  26 siblings, 0 replies; 107+ messages in thread
From: Taylor Blau @ 2023-11-28 19:08 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano

Further prepare for enabling verbatim pack-reuse over multiple packfiles
by changing the signature of reuse_partial_packfile_from_bitmap() to
populate an array of `struct bitmapped_pack *`'s instead of a pointer to
a single packfile.

Since the array we're filling out is sized dynamically[^1], add an
additional `size_t *` parameter which will hold the number of reusable
packs (equal to the number of elements in the array).

Note that since we still have not implemented true multi-pack reuse,
these changes aren't propagated out to the rest of the caller in
builtin/pack-objects.c.

In the interim state, we expect that the array has a single element, and
we use that element to fill out the static `reuse_packfile` variable
(which is a bog-standard `struct packed_git *`). Future commits will
continue to push this change further out through the pack-objects code.

[^1]: That is, even though we know the number of packs which are
  candidates for pack-reuse, we do not know how many of those
  candidates we can actually reuse.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 9 +++++++--
 pack-bitmap.c          | 6 ++++--
 pack-bitmap.h          | 5 +++--
 3 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 2bb1b64e8f..89de23f39a 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3943,14 +3943,19 @@ static int pack_options_allow_reuse(void)
 
 static int get_object_list_from_bitmap(struct rev_info *revs)
 {
+	struct bitmapped_pack *packs = NULL;
+	size_t packs_nr = 0;
+
 	if (!(bitmap_git = prepare_bitmap_walk(revs, 0)))
 		return -1;
 
 	if (pack_options_allow_reuse())
-		reuse_partial_packfile_from_bitmap(bitmap_git, &reuse_packfile,
+		reuse_partial_packfile_from_bitmap(bitmap_git, &packs,
+						   &packs_nr,
 						   &reuse_packfile_bitmap);
 
-	if (reuse_packfile) {
+	if (packs) {
+		reuse_packfile = packs[0].p;
 		reuse_packfile_objects = bitmap_popcount(reuse_packfile_bitmap);
 		if (!reuse_packfile_objects)
 			BUG("expected non-empty reuse bitmap");
diff --git a/pack-bitmap.c b/pack-bitmap.c
index 614fc09a4e..670deec909 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -1989,7 +1989,8 @@ static void reuse_partial_packfile_from_bitmap_1(struct bitmap_index *bitmap_git
 }
 
 void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
-					struct packed_git **packfile_out,
+					struct bitmapped_pack **packs_out,
+					size_t *packs_nr_out,
 					struct bitmap **reuse_out)
 {
 	struct repository *r = the_repository;
@@ -2056,7 +2057,8 @@ void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 	 * need to be handled separately.
 	 */
 	bitmap_and_not(result, reuse);
-	*packfile_out = packs[0].p;
+	*packs_out = packs;
+	*packs_nr_out = packs_nr;
 	*reuse_out = reuse;
 }
 
diff --git a/pack-bitmap.h b/pack-bitmap.h
index 5bc1ca5b65..901a3b86ed 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -78,8 +78,9 @@ int test_bitmap_hashes(struct repository *r);
 struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
 					 int filter_provided_objects);
 uint32_t midx_preferred_pack(struct bitmap_index *bitmap_git);
-void reuse_partial_packfile_from_bitmap(struct bitmap_index *,
-					struct packed_git **packfile,
+void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
+					struct bitmapped_pack **packs_out,
+					size_t *packs_nr_out,
 					struct bitmap **reuse_out);
 int rebuild_existing_bitmaps(struct bitmap_index *, struct packing_data *mapping,
 			     kh_oid_map_t *reused_bitmaps, int show_progress);
-- 
2.43.0.24.g980b318f98


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 13/24] pack-objects: parameterize pack-reuse routines over a single pack
  2023-11-28 19:07 [PATCH 00/24] pack-objects: multi-pack verbatim reuse Taylor Blau
                   ` (11 preceding siblings ...)
  2023-11-28 19:08 ` [PATCH 12/24] pack-bitmap: return multiple packs via `reuse_partial_packfile_from_bitmap()` Taylor Blau
@ 2023-11-28 19:08 ` Taylor Blau
  2023-11-28 19:08 ` [PATCH 14/24] pack-objects: keep track of `pack_start` for each reuse pack Taylor Blau
                   ` (13 subsequent siblings)
  26 siblings, 0 replies; 107+ messages in thread
From: Taylor Blau @ 2023-11-28 19:08 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano

The routines pack-objects uses to perform verbatim pack-reuse are:

  - write_reused_pack_one()
  - write_reused_pack_verbatim()
  - write_reused_pack()

, all of which assume that there is exactly one packfile being reused:
the global constant `reuse_packfile`.

Prepare for reusing objects from multiple packs by making reuse packfile
a parameter of each of the above functions in preparation for calling
these functions in a loop with multiple packfiles.

Note that we still have the global "reuse_packfile", but pass it through
each of the above function's parameter lists, eliminating all but one
direct access (the top-level caller in `write_pack_file()`). Even after
this series, we will still have a global, but it will hold the array of
reusable packfiles, and we'll pass them one at a time to these functions
in a loop.

Note also that we will eventually need to pass a `bitmapped_pack`
instead of a `packed_git` in order to hold onto additional information
required for reuse (such as the bit position of the first object
belonging to that pack). But that change will be made in a future commit
so as to minimize the noise below as much as possible.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 89de23f39a..7682bd65bb 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1014,7 +1014,8 @@ static off_t find_reused_offset(off_t where)
 	return reused_chunks[lo-1].difference;
 }
 
-static void write_reused_pack_one(size_t pos, struct hashfile *out,
+static void write_reused_pack_one(struct packed_git *reuse_packfile,
+				  size_t pos, struct hashfile *out,
 				  struct pack_window **w_curs)
 {
 	off_t offset, next, cur;
@@ -1092,7 +1093,8 @@ static void write_reused_pack_one(size_t pos, struct hashfile *out,
 	copy_pack_data(out, reuse_packfile, w_curs, offset, next - offset);
 }
 
-static size_t write_reused_pack_verbatim(struct hashfile *out,
+static size_t write_reused_pack_verbatim(struct packed_git *reuse_packfile,
+					 struct hashfile *out,
 					 struct pack_window **w_curs)
 {
 	size_t pos = 0;
@@ -1119,14 +1121,15 @@ static size_t write_reused_pack_verbatim(struct hashfile *out,
 	return pos;
 }
 
-static void write_reused_pack(struct hashfile *f)
+static void write_reused_pack(struct packed_git *reuse_packfile,
+			      struct hashfile *f)
 {
 	size_t i = 0;
 	uint32_t offset;
 	struct pack_window *w_curs = NULL;
 
 	if (allow_ofs_delta)
-		i = write_reused_pack_verbatim(f, &w_curs);
+		i = write_reused_pack_verbatim(reuse_packfile, f, &w_curs);
 
 	for (; i < reuse_packfile_bitmap->word_alloc; ++i) {
 		eword_t word = reuse_packfile_bitmap->words[i];
@@ -1142,7 +1145,8 @@ static void write_reused_pack(struct hashfile *f)
 			 * bitmaps. See comment in try_partial_reuse()
 			 * for why.
 			 */
-			write_reused_pack_one(pos + offset, f, &w_curs);
+			write_reused_pack_one(reuse_packfile, pos + offset, f,
+					      &w_curs);
 			display_progress(progress_state, ++written);
 		}
 	}
@@ -1200,7 +1204,7 @@ static void write_pack_file(void)
 
 		if (reuse_packfile) {
 			assert(pack_to_stdout);
-			write_reused_pack(f);
+			write_reused_pack(reuse_packfile, f);
 			offset = hashfile_total(f);
 		}
 
-- 
2.43.0.24.g980b318f98


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 14/24] pack-objects: keep track of `pack_start` for each reuse pack
  2023-11-28 19:07 [PATCH 00/24] pack-objects: multi-pack verbatim reuse Taylor Blau
                   ` (12 preceding siblings ...)
  2023-11-28 19:08 ` [PATCH 13/24] pack-objects: parameterize pack-reuse routines over a single pack Taylor Blau
@ 2023-11-28 19:08 ` Taylor Blau
  2023-12-07 13:13   ` Patrick Steinhardt
  2023-11-28 19:08 ` [PATCH 15/24] pack-objects: pass `bitmapped_pack`'s to pack-reuse functions Taylor Blau
                   ` (12 subsequent siblings)
  26 siblings, 1 reply; 107+ messages in thread
From: Taylor Blau @ 2023-11-28 19:08 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano

When reusing objects from a pack, we keep track of a set of one or more
`reused_chunk`s, corresponding to sections of one or more object(s) from
a source pack that we are reusing. Each chunk contains two pieces of
information:

  - the offset of the first object in the source pack (relative to the
    beginning of the source pack)
  - the difference between that offset, and the corresponding offset in
    the pack we're generating

The purpose of keeping track of these is so that we can patch an
OFS_DELTAs that cross over a section of the reuse pack that we didn't
take.

For instance, consider a hypothetical pack as shown below:

                                                (chunk #2)
                                                __________...
                                               /
                                              /
      +--------+---------+-------------------+---------+
  ... | <base> | <other> |      (unused)     | <delta> | ...
      +--------+---------+-------------------+---------+
       \                /
        \______________/
           (chunk #1)

Suppose that we are sending objects "base", "other", and "delta", and
that the "delta" object is stored as an OFS_DELTA, and that its base is
"base". If we don't send any objects in the "(unused)" range, we can't
copy the delta'd object directly, since its delta offset includes a
range of the pack that we didn't copy, so we have to account for that
difference when patching and reassembling the delta.

In order to compute this value correctly, we need to know not only where
we are in the packfile we're assembling (with `hashfile_total(f)`) but
also the position of the first byte of the packfile that we are
currently reusing.

Together, these two allow us to compute the reused chunk's offset
difference relative to the start of the reused pack, as desired.

Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 7682bd65bb..eb8be514d1 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1016,6 +1016,7 @@ static off_t find_reused_offset(off_t where)
 
 static void write_reused_pack_one(struct packed_git *reuse_packfile,
 				  size_t pos, struct hashfile *out,
+				  off_t pack_start,
 				  struct pack_window **w_curs)
 {
 	off_t offset, next, cur;
@@ -1025,7 +1026,8 @@ static void write_reused_pack_one(struct packed_git *reuse_packfile,
 	offset = pack_pos_to_offset(reuse_packfile, pos);
 	next = pack_pos_to_offset(reuse_packfile, pos + 1);
 
-	record_reused_object(offset, offset - hashfile_total(out));
+	record_reused_object(offset,
+			     offset - (hashfile_total(out) - pack_start));
 
 	cur = offset;
 	type = unpack_object_header(reuse_packfile, w_curs, &cur, &size);
@@ -1095,6 +1097,7 @@ static void write_reused_pack_one(struct packed_git *reuse_packfile,
 
 static size_t write_reused_pack_verbatim(struct packed_git *reuse_packfile,
 					 struct hashfile *out,
+					 off_t pack_start UNUSED,
 					 struct pack_window **w_curs)
 {
 	size_t pos = 0;
@@ -1126,10 +1129,12 @@ static void write_reused_pack(struct packed_git *reuse_packfile,
 {
 	size_t i = 0;
 	uint32_t offset;
+	off_t pack_start = hashfile_total(f) - sizeof(struct pack_header);
 	struct pack_window *w_curs = NULL;
 
 	if (allow_ofs_delta)
-		i = write_reused_pack_verbatim(reuse_packfile, f, &w_curs);
+		i = write_reused_pack_verbatim(reuse_packfile, f, pack_start,
+					       &w_curs);
 
 	for (; i < reuse_packfile_bitmap->word_alloc; ++i) {
 		eword_t word = reuse_packfile_bitmap->words[i];
@@ -1146,7 +1151,7 @@ static void write_reused_pack(struct packed_git *reuse_packfile,
 			 * for why.
 			 */
 			write_reused_pack_one(reuse_packfile, pos + offset, f,
-					      &w_curs);
+					      pack_start, &w_curs);
 			display_progress(progress_state, ++written);
 		}
 	}
-- 
2.43.0.24.g980b318f98


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 15/24] pack-objects: pass `bitmapped_pack`'s to pack-reuse functions
  2023-11-28 19:07 [PATCH 00/24] pack-objects: multi-pack verbatim reuse Taylor Blau
                   ` (13 preceding siblings ...)
  2023-11-28 19:08 ` [PATCH 14/24] pack-objects: keep track of `pack_start` for each reuse pack Taylor Blau
@ 2023-11-28 19:08 ` Taylor Blau
  2023-11-28 19:08 ` [PATCH 16/24] pack-objects: prepare `write_reused_pack()` for multi-pack reuse Taylor Blau
                   ` (11 subsequent siblings)
  26 siblings, 0 replies; 107+ messages in thread
From: Taylor Blau @ 2023-11-28 19:08 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano

Further prepare pack-objects to perform verbatim pack-reuse over
multiple packfiles by converting functions that take in a pointer to a
`struct packed_git` to instead take in a pointer to a `struct
bitmapped_pack`.

The additional information found in the bitmapped_pack struct (such as
the bit position corresponding to the beginning of the pack) will be
necessary in order to perform verbatim pack-reuse.

Note that we don't use any of the extra pieces of information contained
in the bitmapped_pack struct, so this step is merely preparatory and
does not introduce any functional changes.

Note further that we do not change the argument type to
write_reused_pack_one(). That function is responsible for copying
sections of the packfile directly and optionally patching any OFS_DELTAs
to account for not reusing sections of the packfile in between a delta
and its base.

As such, that function is (and should remain) oblivious to multi-pack
reuse, and does not require any of the extra pieces of information
stored in the bitmapped_pack struct.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 33 +++++++++++++++++----------------
 1 file changed, 17 insertions(+), 16 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index eb8be514d1..3b7704d062 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -222,7 +222,8 @@ static int thin;
 static int num_preferred_base;
 static struct progress *progress_state;
 
-static struct packed_git *reuse_packfile;
+static struct bitmapped_pack *reuse_packfiles;
+static size_t reuse_packfiles_nr;
 static uint32_t reuse_packfile_objects;
 static struct bitmap *reuse_packfile_bitmap;
 
@@ -1095,7 +1096,7 @@ static void write_reused_pack_one(struct packed_git *reuse_packfile,
 	copy_pack_data(out, reuse_packfile, w_curs, offset, next - offset);
 }
 
-static size_t write_reused_pack_verbatim(struct packed_git *reuse_packfile,
+static size_t write_reused_pack_verbatim(struct bitmapped_pack *reuse_packfile,
 					 struct hashfile *out,
 					 off_t pack_start UNUSED,
 					 struct pack_window **w_curs)
@@ -1110,13 +1111,13 @@ static size_t write_reused_pack_verbatim(struct packed_git *reuse_packfile,
 		off_t to_write;
 
 		written = (pos * BITS_IN_EWORD);
-		to_write = pack_pos_to_offset(reuse_packfile, written)
+		to_write = pack_pos_to_offset(reuse_packfile->p, written)
 			- sizeof(struct pack_header);
 
 		/* We're recording one chunk, not one object. */
 		record_reused_object(sizeof(struct pack_header), 0);
 		hashflush(out);
-		copy_pack_data(out, reuse_packfile, w_curs,
+		copy_pack_data(out, reuse_packfile->p, w_curs,
 			sizeof(struct pack_header), to_write);
 
 		display_progress(progress_state, written);
@@ -1124,7 +1125,7 @@ static size_t write_reused_pack_verbatim(struct packed_git *reuse_packfile,
 	return pos;
 }
 
-static void write_reused_pack(struct packed_git *reuse_packfile,
+static void write_reused_pack(struct bitmapped_pack *reuse_packfile,
 			      struct hashfile *f)
 {
 	size_t i = 0;
@@ -1150,8 +1151,8 @@ static void write_reused_pack(struct packed_git *reuse_packfile,
 			 * bitmaps. See comment in try_partial_reuse()
 			 * for why.
 			 */
-			write_reused_pack_one(reuse_packfile, pos + offset, f,
-					      pack_start, &w_curs);
+			write_reused_pack_one(reuse_packfile->p, pos + offset,
+					      f, pack_start, &w_curs);
 			display_progress(progress_state, ++written);
 		}
 	}
@@ -1207,9 +1208,12 @@ static void write_pack_file(void)
 
 		offset = write_pack_header(f, nr_remaining);
 
-		if (reuse_packfile) {
+		if (reuse_packfiles_nr) {
 			assert(pack_to_stdout);
-			write_reused_pack(reuse_packfile, f);
+			for (j = 0; j < reuse_packfiles_nr; j++) {
+				reused_chunks_nr = 0;
+				write_reused_pack(&reuse_packfiles[j], f);
+			}
 			offset = hashfile_total(f);
 		}
 
@@ -3952,19 +3956,16 @@ static int pack_options_allow_reuse(void)
 
 static int get_object_list_from_bitmap(struct rev_info *revs)
 {
-	struct bitmapped_pack *packs = NULL;
-	size_t packs_nr = 0;
-
 	if (!(bitmap_git = prepare_bitmap_walk(revs, 0)))
 		return -1;
 
 	if (pack_options_allow_reuse())
-		reuse_partial_packfile_from_bitmap(bitmap_git, &packs,
-						   &packs_nr,
+		reuse_partial_packfile_from_bitmap(bitmap_git,
+						   &reuse_packfiles,
+						   &reuse_packfiles_nr,
 						   &reuse_packfile_bitmap);
 
-	if (packs) {
-		reuse_packfile = packs[0].p;
+	if (reuse_packfiles) {
 		reuse_packfile_objects = bitmap_popcount(reuse_packfile_bitmap);
 		if (!reuse_packfile_objects)
 			BUG("expected non-empty reuse bitmap");
-- 
2.43.0.24.g980b318f98


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 16/24] pack-objects: prepare `write_reused_pack()` for multi-pack reuse
  2023-11-28 19:07 [PATCH 00/24] pack-objects: multi-pack verbatim reuse Taylor Blau
                   ` (14 preceding siblings ...)
  2023-11-28 19:08 ` [PATCH 15/24] pack-objects: pass `bitmapped_pack`'s to pack-reuse functions Taylor Blau
@ 2023-11-28 19:08 ` Taylor Blau
  2023-12-07 13:13   ` Patrick Steinhardt
  2023-11-28 19:08 ` [PATCH 17/24] pack-objects: prepare `write_reused_pack_verbatim()` " Taylor Blau
                   ` (10 subsequent siblings)
  26 siblings, 1 reply; 107+ messages in thread
From: Taylor Blau @ 2023-11-28 19:08 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano

The function `write_reused_pack()` within `builtin/pack-objects.c` is
responsible for performing pack-reuse on a single pack, and has two main
functions:

  - it dispatches a call to `write_reused_pack_verbatim()` to see if we
    can reuse portions of the packfile in whole-word chunks

  - for any remaining objects (that is, any objects that appear after
    the first "gap" in the bitmap), call write_reused_pack_one() on that
    object to record it for reuse.

Prepare this function for multi-pack reuse by removing the assumption
that the bit position corresponding to the first object being reused
from a given pack may not be at bit position zero.

The changes in this function are mostly straightforward. Initialize `i`
to the position of the first word to contain bits corresponding to that
reuse pack. In most situations, we throw the initialized value away,
since we end up replacing it with the return value from
write_reused_pack_verbatim(), moving us past the section of whole words
that we reused.

Likewise, modify the per-object loop to ignore any bits at the beginning
of the first word that do not belong to the pack currently being reused,
as well as skip to the "done" section once we have processed the last
bit corresponding to this pack.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 3b7704d062..b5e6f6377a 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1128,7 +1128,7 @@ static size_t write_reused_pack_verbatim(struct bitmapped_pack *reuse_packfile,
 static void write_reused_pack(struct bitmapped_pack *reuse_packfile,
 			      struct hashfile *f)
 {
-	size_t i = 0;
+	size_t i = reuse_packfile->bitmap_pos / BITS_IN_EWORD;
 	uint32_t offset;
 	off_t pack_start = hashfile_total(f) - sizeof(struct pack_header);
 	struct pack_window *w_curs = NULL;
@@ -1146,17 +1146,23 @@ static void write_reused_pack(struct bitmapped_pack *reuse_packfile,
 				break;
 
 			offset += ewah_bit_ctz64(word >> offset);
+			if (pos + offset < reuse_packfile->bitmap_pos)
+				continue;
+			if (pos + offset >= reuse_packfile->bitmap_pos + reuse_packfile->bitmap_nr)
+				goto done;
 			/*
 			 * Can use bit positions directly, even for MIDX
 			 * bitmaps. See comment in try_partial_reuse()
 			 * for why.
 			 */
-			write_reused_pack_one(reuse_packfile->p, pos + offset,
+			write_reused_pack_one(reuse_packfile->p,
+					      pos + offset - reuse_packfile->bitmap_pos,
 					      f, pack_start, &w_curs);
 			display_progress(progress_state, ++written);
 		}
 	}
 
+done:
 	unuse_pack(&w_curs);
 }
 
-- 
2.43.0.24.g980b318f98


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 17/24] pack-objects: prepare `write_reused_pack_verbatim()` for multi-pack reuse
  2023-11-28 19:07 [PATCH 00/24] pack-objects: multi-pack verbatim reuse Taylor Blau
                   ` (15 preceding siblings ...)
  2023-11-28 19:08 ` [PATCH 16/24] pack-objects: prepare `write_reused_pack()` for multi-pack reuse Taylor Blau
@ 2023-11-28 19:08 ` Taylor Blau
  2023-11-28 19:08 ` [PATCH 18/24] pack-objects: include number of packs reused in output Taylor Blau
                   ` (9 subsequent siblings)
  26 siblings, 0 replies; 107+ messages in thread
From: Taylor Blau @ 2023-11-28 19:08 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano

The function `write_reused_pack_verbatim()` within
`builtin/pack-objects.c` is responsible for writing out a continuous
set of objects beginning at the start of the reuse packfile.

In the existing implementation, we did something like:

    while (pos < reuse_packfile_bitmap->word_alloc &&
           reuse_packfile_bitmap->words[pos] == (eword_t)~0)
      pos++;

    if (pos)
      /* write first `pos * BITS_IN_WORD` objects from pack */

as an optimization to record a single chunk for the longest continuous
prefix of objects wanted out of the reuse pack, instead of having a
chunk for each individual object. For more details, see bb514de356
(pack-objects: improve partial packfile reuse, 2019-12-18).

In order to retain this optimization in a multi-pack reuse world, we can
no longer assume that the first object in a pack is on a word boundary
in the bitmap storing the set of reusable objects.

Assuming that all objects from the beginning of the reuse packfile up to
the object corresponding to the first bit on a word boundary are part of
the result, consume whole words at a time until the last whole word
belonging to the reuse packfile. Copy those objects to the resulting
packfile, and track that we reused them by recording a single chunk.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 73 ++++++++++++++++++++++++++++++++++--------
 1 file changed, 60 insertions(+), 13 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index b5e6f6377a..e37509568b 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1098,31 +1098,78 @@ static void write_reused_pack_one(struct packed_git *reuse_packfile,
 
 static size_t write_reused_pack_verbatim(struct bitmapped_pack *reuse_packfile,
 					 struct hashfile *out,
-					 off_t pack_start UNUSED,
+					 off_t pack_start,
 					 struct pack_window **w_curs)
 {
-	size_t pos = 0;
+	size_t pos = reuse_packfile->bitmap_pos;
+	size_t end;
 
-	while (pos < reuse_packfile_bitmap->word_alloc &&
-			reuse_packfile_bitmap->words[pos] == (eword_t)~0)
-		pos++;
+	if (pos % BITS_IN_EWORD) {
+		size_t word_pos = (pos / BITS_IN_EWORD);
+		size_t offset = pos % BITS_IN_EWORD;
+		size_t last;
+		eword_t word = reuse_packfile_bitmap->words[word_pos];
 
-	if (pos) {
-		off_t to_write;
+		if (offset + reuse_packfile->bitmap_nr < BITS_IN_EWORD)
+			last = offset + reuse_packfile->bitmap_nr;
+		else
+			last = BITS_IN_EWORD;
 
-		written = (pos * BITS_IN_EWORD);
-		to_write = pack_pos_to_offset(reuse_packfile->p, written)
-			- sizeof(struct pack_header);
+		for (; offset < last; offset++) {
+			if (word >> offset == 0)
+				return word_pos;
+			if (!bitmap_get(reuse_packfile_bitmap,
+					word_pos * BITS_IN_EWORD + offset))
+				return word_pos;
+		}
+
+		pos += BITS_IN_EWORD - (pos % BITS_IN_EWORD);
+	}
+
+	/*
+	 * Now we're going to copy as many whole eword_t's as possible.
+	 * "end" is the index of the last whole eword_t we copy, but
+	 * there may be additional bits to process. Those are handled
+	 * individually by write_reused_pack().
+	 *
+	 * Begin by advancing to the first word boundary in range of the
+	 * bit positions occupied by objects in "reuse_packfile". Then
+	 * pick the last word boundary in the same range. If we have at
+	 * least one word's worth of bits to process, continue on.
+	 */
+	end = reuse_packfile->bitmap_pos + reuse_packfile->bitmap_nr;
+	if (end % BITS_IN_EWORD)
+		end -= end % BITS_IN_EWORD;
+	if (pos >= end)
+		return reuse_packfile->bitmap_pos / BITS_IN_EWORD;
+
+	while (pos < end &&
+	       reuse_packfile_bitmap->words[pos / BITS_IN_EWORD] == (eword_t)~0)
+		pos += BITS_IN_EWORD;
+
+	if (pos > end)
+		pos = end;
+
+	if (reuse_packfile->bitmap_pos < pos) {
+		off_t pack_start_off = pack_pos_to_offset(reuse_packfile->p, 0);
+		off_t pack_end_off = pack_pos_to_offset(reuse_packfile->p,
+							pos - reuse_packfile->bitmap_pos);
+
+		written += pos - reuse_packfile->bitmap_pos;
 
 		/* We're recording one chunk, not one object. */
-		record_reused_object(sizeof(struct pack_header), 0);
+		record_reused_object(pack_start_off,
+				     pack_start_off - (hashfile_total(out) - pack_start));
 		hashflush(out);
 		copy_pack_data(out, reuse_packfile->p, w_curs,
-			sizeof(struct pack_header), to_write);
+			pack_start_off, pack_end_off - pack_start_off);
 
 		display_progress(progress_state, written);
 	}
-	return pos;
+	if (pos % BITS_IN_EWORD)
+		BUG("attempted to jump past a word boundary to %"PRIuMAX,
+		    (uintmax_t)pos);
+	return pos / BITS_IN_EWORD;
 }
 
 static void write_reused_pack(struct bitmapped_pack *reuse_packfile,
-- 
2.43.0.24.g980b318f98


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 18/24] pack-objects: include number of packs reused in output
  2023-11-28 19:07 [PATCH 00/24] pack-objects: multi-pack verbatim reuse Taylor Blau
                   ` (16 preceding siblings ...)
  2023-11-28 19:08 ` [PATCH 17/24] pack-objects: prepare `write_reused_pack_verbatim()` " Taylor Blau
@ 2023-11-28 19:08 ` Taylor Blau
  2023-11-28 19:08 ` [PATCH 19/24] pack-bitmap: prepare to mark objects from multiple packs for reuse Taylor Blau
                   ` (8 subsequent siblings)
  26 siblings, 0 replies; 107+ messages in thread
From: Taylor Blau @ 2023-11-28 19:08 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano

In addition to including the number of objects reused verbatim from a
reuse-pack, include the number of packs from which objects were reused.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index e37509568b..902e70abc5 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -224,6 +224,7 @@ static struct progress *progress_state;
 
 static struct bitmapped_pack *reuse_packfiles;
 static size_t reuse_packfiles_nr;
+static size_t reuse_packfiles_used_nr;
 static uint32_t reuse_packfile_objects;
 static struct bitmap *reuse_packfile_bitmap;
 
@@ -1266,6 +1267,8 @@ static void write_pack_file(void)
 			for (j = 0; j < reuse_packfiles_nr; j++) {
 				reused_chunks_nr = 0;
 				write_reused_pack(&reuse_packfiles[j], f);
+				if (reused_chunks_nr)
+					reuse_packfiles_used_nr++;
 			}
 			offset = hashfile_total(f);
 		}
@@ -4612,9 +4615,10 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		fprintf_ln(stderr,
 			   _("Total %"PRIu32" (delta %"PRIu32"),"
 			     " reused %"PRIu32" (delta %"PRIu32"),"
-			     " pack-reused %"PRIu32),
+			     " pack-reused %"PRIu32" (from %"PRIuMAX")"),
 			   written, written_delta, reused, reused_delta,
-			   reuse_packfile_objects);
+			   reuse_packfile_objects,
+			   (uintmax_t)reuse_packfiles_used_nr);
 
 cleanup:
 	free_packing_data(&to_pack);
-- 
2.43.0.24.g980b318f98


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 19/24] pack-bitmap: prepare to mark objects from multiple packs for reuse
  2023-11-28 19:07 [PATCH 00/24] pack-objects: multi-pack verbatim reuse Taylor Blau
                   ` (17 preceding siblings ...)
  2023-11-28 19:08 ` [PATCH 18/24] pack-objects: include number of packs reused in output Taylor Blau
@ 2023-11-28 19:08 ` Taylor Blau
  2023-11-28 19:08 ` [PATCH 20/24] pack-objects: add tracing for various packfile metrics Taylor Blau
                   ` (7 subsequent siblings)
  26 siblings, 0 replies; 107+ messages in thread
From: Taylor Blau @ 2023-11-28 19:08 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano

Now that the pack-objects code is equipped to handle reusing objects
from multiple packs, prepare the pack-bitmap code to mark objects from
multiple packs as reuse candidates.

In order to prepare the pack-bitmap code for this change, remove the
same set of assumptions we unwound in previous commits from the helper
function `reuse_partial_packfile_from_bitmap_1()`, in preparation for it
to be called in a loop over the set of disjoint packs in a following
commit.

Specifically, remove the assumption that the bit position corresponding
to the first object in a given reuse pack candidate is at a word
boundary. Like in previous commits, we have to walk up to the first word
boundary before marking whole words at a time for reuse. Unlike in
previous commits, however, we have to keep track of whether all of the
objects in the run-up to the first word boundary are wanted in the
resulting pack. This is because we cannot blindly reuse whole words at a
time unless we know for certain that we are sending all bases for any
objects stored as deltas within each word.

Once we're on a word boundary (provided that we want a complete prefix
of objects from the pack), we can then reuse the same "whole-words"
optimization from previous patches, marking all of the bits in a single
word at a time.

Any remaining objects (either from a partial word corresponding to
objects at the end of a pack, or starting from the middle of the pack if
we do not have a complete prefix) are dealt with individually via
try_partial_reuse(), which (among other things) ensures that we are
sending the necessary bases for all objects packed as OFS_DELTA or
REF_DELTA.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-bitmap.c | 113 +++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 93 insertions(+), 20 deletions(-)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index 670deec909..be53fc6da5 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -1940,36 +1940,109 @@ static void reuse_partial_packfile_from_bitmap_1(struct bitmap_index *bitmap_git
 {
 	struct bitmap *result = bitmap_git->result;
 	struct pack_window *w_curs = NULL;
-	size_t i = 0;
+	size_t pos, offset;
+	unsigned complete_prefix = 1;
 
-	while (i < result->word_alloc && result->words[i] == (eword_t)~0)
-		i++;
+	pos = pack->bitmap_pos / BITS_IN_EWORD;
+	offset = pack->bitmap_pos % BITS_IN_EWORD;
 
 	/*
-	 * Don't mark objects not in the packfile or preferred pack. This bitmap
-	 * marks objects eligible for reuse, but the pack-reuse code only
-	 * understands how to reuse a single pack. Since the preferred pack is
-	 * guaranteed to have all bases for its deltas (in a multi-pack bitmap),
-	 * we use it instead of another pack. In single-pack bitmaps, the choice
-	 * is made for us.
+	 * If the position of our first object is not on a word
+	 * boundary, check all bits individually until we reach the
+	 * first word boundary.
+	 *
+	 * If no bits are missing between pack->bitmap_pos and the next
+	 * word boundary, then we can move by whole words instead of by
+	 * individual objects. If one or more of the objects are missing
+	 * in that range, we must evaluate each subsequent object
+	 * individually in order to exclude deltas whose base we are not
+	 * sending, etc.
 	 */
-	if (i > pack->p->num_objects / BITS_IN_EWORD)
-		i = pack->p->num_objects / BITS_IN_EWORD;
+	if (offset) {
+		/*
+		 * Scan to the next word boundary, or through the last
+		 * object in this bitmap, whichever occurs earlier.
+		 */
+		size_t last;
+		eword_t word = result->words[pos];
+		if (pack->bitmap_nr < BITS_IN_EWORD - offset)
+			last = offset + pack->bitmap_nr;
+		else
+			last = BITS_IN_EWORD;
 
-	memset(reuse->words, 0xFF, i * sizeof(eword_t));
+		for (; offset < last; offset++) {
+			size_t pack_pos;
+			if (word >> offset == 0) {
+				complete_prefix = 0;
+				continue;
+			}
 
-	for (; i < result->word_alloc; ++i) {
-		eword_t word = result->words[i];
-		size_t pos = (i * BITS_IN_EWORD);
-		size_t offset;
+			offset += ewah_bit_ctz64(word >> offset);
 
-		for (offset = 0; offset < BITS_IN_EWORD; ++offset) {
-			if ((word >> offset) == 0)
+			pack_pos = pos * BITS_IN_EWORD + offset;
+			pack_pos -= pack->bitmap_pos;
+
+			try_partial_reuse(pack, pack_pos, reuse, &w_curs);
+			if (!bitmap_get(reuse, pos + pack->bitmap_pos))
+				complete_prefix = 0;
+		}
+
+		pos++;
+	}
+
+	if (complete_prefix) {
+		/*
+		 * If we are using all of the objects at the beginning
+		 * of this pack, we can safely reuse objects in eword_t
+		 * sized chunks, since we are guaranteed to send all
+		 * potential delta bases.
+		 *
+		 * Scan the nearest word boundaries within range of this
+		 * pack's bit positions. If the pack does not start on a
+		 * word boundary, skip to the next boundary, since we
+		 * have already checked above.
+		 */
+		size_t start = pos;
+		size_t word_end = (pack->bitmap_pos + pack->bitmap_nr) / BITS_IN_EWORD;
+		while (start <= pos && pos < word_end &&
+		       pos < result->word_alloc &&
+		       result->words[pos] == (eword_t)~0)
+			pos++;
+		memset(reuse->words + start, 0xFF, (pos - start) * sizeof(eword_t));
+	}
+
+	/*
+	 * At this point, we know that we are at an eword boundary,
+	 * either because:
+	 *
+	 *   - we started at one and used zero or more whole words
+	 *     following pack->bitmap_pos
+	 *
+	 *   - we started in between two word boundaries, advanced
+	 *     forward to the next word boundary, and then used zero or
+	 *     more (assuming a complete prefix) whole words following.
+	 */
+	for (; pos < result->word_alloc; pos++) {
+		eword_t word = result->words[pos];
+
+		for (offset = 0; offset < BITS_IN_EWORD; offset++) {
+			size_t bit_pos, pack_pos;
+			if (word >> offset == 0)
 				break;
 
 			offset += ewah_bit_ctz64(word >> offset);
-			if (try_partial_reuse(pack, pos + offset,
-					      reuse, &w_curs) < 0) {
+
+			bit_pos = pos * BITS_IN_EWORD + offset;
+			if (bit_pos >= pack->bitmap_pos + pack->bitmap_nr)
+				goto done;
+
+			pack_pos = bit_pos - pack->bitmap_pos;
+			if (pack_pos >= pack->p->num_objects)
+				BUG("advanced beyond the end of pack %s (%"PRIuMAX" > %"PRIu32")",
+				    pack_basename(pack->p), (uintmax_t)pack_pos,
+				    pack->p->num_objects);
+
+			if (try_partial_reuse(pack, pack_pos, reuse, &w_curs) < 0) {
 				/*
 				 * try_partial_reuse indicated we couldn't reuse
 				 * any bits, so there is no point in trying more
-- 
2.43.0.24.g980b318f98


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 20/24] pack-objects: add tracing for various packfile metrics
  2023-11-28 19:07 [PATCH 00/24] pack-objects: multi-pack verbatim reuse Taylor Blau
                   ` (18 preceding siblings ...)
  2023-11-28 19:08 ` [PATCH 19/24] pack-bitmap: prepare to mark objects from multiple packs for reuse Taylor Blau
@ 2023-11-28 19:08 ` Taylor Blau
  2023-11-28 19:08 ` [PATCH 21/24] t/test-lib-functions.sh: implement `test_trace2_data` helper Taylor Blau
                   ` (6 subsequent siblings)
  26 siblings, 0 replies; 107+ messages in thread
From: Taylor Blau @ 2023-11-28 19:08 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano

As part of the multi-pack reuse effort, we will want to add some tests
that assert that we reused a certain number of objects from a certain
number of packs.

We could do this by grepping through the stderr output of
`pack-objects`, but doing so would be brittle in case the output format
changed.

Instead, let's use the trace2 mechanism to log various pieces of
information about the generated packfile, which we can then use to
compare against desired values.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 902e70abc5..fa71fe1ccf 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -4620,6 +4620,13 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 			   reuse_packfile_objects,
 			   (uintmax_t)reuse_packfiles_used_nr);
 
+	trace2_data_intmax("pack-objects", the_repository, "written", written);
+	trace2_data_intmax("pack-objects", the_repository, "written/delta", written_delta);
+	trace2_data_intmax("pack-objects", the_repository, "reused", reused);
+	trace2_data_intmax("pack-objects", the_repository, "reused/delta", reused_delta);
+	trace2_data_intmax("pack-objects", the_repository, "pack-reused", reuse_packfile_objects);
+	trace2_data_intmax("pack-objects", the_repository, "packs-reused", reuse_packfiles_used_nr);
+
 cleanup:
 	free_packing_data(&to_pack);
 	list_objects_filter_release(&filter_options);
-- 
2.43.0.24.g980b318f98


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 21/24] t/test-lib-functions.sh: implement `test_trace2_data` helper
  2023-11-28 19:07 [PATCH 00/24] pack-objects: multi-pack verbatim reuse Taylor Blau
                   ` (19 preceding siblings ...)
  2023-11-28 19:08 ` [PATCH 20/24] pack-objects: add tracing for various packfile metrics Taylor Blau
@ 2023-11-28 19:08 ` Taylor Blau
  2023-11-28 19:08 ` [PATCH 22/24] pack-objects: allow setting `pack.allowPackReuse` to "single" Taylor Blau
                   ` (5 subsequent siblings)
  26 siblings, 0 replies; 107+ messages in thread
From: Taylor Blau @ 2023-11-28 19:08 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano

Introduce a helper function which looks for a specific (category, key,
value) tuple in the output of a trace2 event stream.

We will use this function in a future patch to ensure that the expected
number of objects are reused from an expected number of packs.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/test-lib-functions.sh | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh
index 9c3cf12b26..93fe819b0a 100644
--- a/t/test-lib-functions.sh
+++ b/t/test-lib-functions.sh
@@ -1874,6 +1874,20 @@ test_region () {
 	return 0
 }
 
+# Check that the given data fragment was included as part of the
+# trace2-format trace on stdin.
+#
+#	test_trace2_data <category> <key> <value>
+#
+# For example, to look for trace2_data_intmax("pack-objects", repo,
+# "reused", N) in an invocation of "git pack-objects", run:
+#
+#	GIT_TRACE2_EVENT="$(pwd)/trace.txt" git pack-objects ... &&
+#	test_trace2_data pack-objects reused N <trace2.txt
+test_trace2_data () {
+	grep -e '"category":"'"$1"'","key":"'"$2"'","value":"'"$3"'"'
+}
+
 # Given a GIT_TRACE2_EVENT log over stdin, writes to stdout a list of URLs
 # sent to git-remote-https child processes.
 test_remote_https_urls() {
-- 
2.43.0.24.g980b318f98


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 22/24] pack-objects: allow setting `pack.allowPackReuse` to "single"
  2023-11-28 19:07 [PATCH 00/24] pack-objects: multi-pack verbatim reuse Taylor Blau
                   ` (20 preceding siblings ...)
  2023-11-28 19:08 ` [PATCH 21/24] t/test-lib-functions.sh: implement `test_trace2_data` helper Taylor Blau
@ 2023-11-28 19:08 ` Taylor Blau
  2023-11-28 19:08 ` [PATCH 23/24] pack-bitmap: reuse objects from all disjoint packs Taylor Blau
                   ` (4 subsequent siblings)
  26 siblings, 0 replies; 107+ messages in thread
From: Taylor Blau @ 2023-11-28 19:08 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano

In e704fc7978 (pack-objects: introduce pack.allowPackReuse, 2019-12-18),
the `pack.allowPackReuse` configuration option was introduced, allowing
users to disable the pack reuse mechanism.

To prepare for debugging multi-pack reuse, allow setting configuration
to "single" in addition to the usual bool-or-int values.

"single" implies the same behavior as "true", "1", "yes", and so on. But
it will complement a new "multi" value (to be introduced in a future
commit). When set to "single", we will only perform pack reuse on a
single pack, regardless of whether or not there are multiple disjoint
packs.

This requires no code changes (yet), since we only support single pack
reuse.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/config/pack.txt |  2 +-
 builtin/pack-objects.c        | 19 ++++++++++++++++---
 2 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/Documentation/config/pack.txt b/Documentation/config/pack.txt
index f50df9dbce..fe100d0fb7 100644
--- a/Documentation/config/pack.txt
+++ b/Documentation/config/pack.txt
@@ -28,7 +28,7 @@ all existing objects. You can force recompression by passing the -F option
 to linkgit:git-repack[1].
 
 pack.allowPackReuse::
-	When true, and when reachability bitmaps are enabled,
+	When true or "single", and when reachability bitmaps are enabled,
 	pack-objects will try to send parts of the bitmapped packfile
 	verbatim. This can reduce memory and CPU usage to serve fetches,
 	but might result in sending a slightly larger pack. Defaults to
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index fa71fe1ccf..4853e91251 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -230,7 +230,10 @@ static struct bitmap *reuse_packfile_bitmap;
 
 static int use_bitmap_index_default = 1;
 static int use_bitmap_index = -1;
-static int allow_pack_reuse = 1;
+static enum {
+	NO_PACK_REUSE = 0,
+	SINGLE_PACK_REUSE,
+} allow_pack_reuse = SINGLE_PACK_REUSE;
 static enum {
 	WRITE_BITMAP_FALSE = 0,
 	WRITE_BITMAP_QUIET,
@@ -3246,7 +3249,17 @@ static int git_pack_config(const char *k, const char *v,
 		return 0;
 	}
 	if (!strcmp(k, "pack.allowpackreuse")) {
-		allow_pack_reuse = git_config_bool(k, v);
+		int res = git_parse_maybe_bool_text(v);
+		if (res < 0) {
+			if (!strcasecmp(v, "single"))
+				allow_pack_reuse = SINGLE_PACK_REUSE;
+			else
+				die(_("invalid pack.allowPackReuse value: '%s'"), v);
+		} else if (res) {
+			allow_pack_reuse = SINGLE_PACK_REUSE;
+		} else {
+			allow_pack_reuse = NO_PACK_REUSE;
+		}
 		return 0;
 	}
 	if (!strcmp(k, "pack.threads")) {
@@ -4002,7 +4015,7 @@ static void loosen_unused_packed_objects(void)
  */
 static int pack_options_allow_reuse(void)
 {
-	return allow_pack_reuse &&
+	return allow_pack_reuse != NO_PACK_REUSE &&
 	       pack_to_stdout &&
 	       !ignore_packed_keep_on_disk &&
 	       !ignore_packed_keep_in_core &&
-- 
2.43.0.24.g980b318f98


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 23/24] pack-bitmap: reuse objects from all disjoint packs
  2023-11-28 19:07 [PATCH 00/24] pack-objects: multi-pack verbatim reuse Taylor Blau
                   ` (21 preceding siblings ...)
  2023-11-28 19:08 ` [PATCH 22/24] pack-objects: allow setting `pack.allowPackReuse` to "single" Taylor Blau
@ 2023-11-28 19:08 ` Taylor Blau
  2023-11-28 19:08 ` [PATCH 24/24] t/perf: add performance tests for multi-pack reuse Taylor Blau
                   ` (3 subsequent siblings)
  26 siblings, 0 replies; 107+ messages in thread
From: Taylor Blau @ 2023-11-28 19:08 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano

Now that both the pack-bitmap and pack-objects code are prepared to
handle marking and using objects from multiple disjoint packs for
verbatim reuse, allow marking objects from all disjoint packs as
eligible for reuse.

Within the `reuse_partial_packfile_from_bitmap()` function, we no longer
only mark the pack whose first object is at bit position zero for reuse,
and instead mark any pack which is flagged as disjoint by the MIDX as a
reuse candidate. If no such packs exist (i.e because we are reading a
MIDX written before the "DISP" chunk was introduced), then treat the
preferred pack as disjoint for the purposes of reuse. This is a safe
assumption to make since all duplicate objects are resolved in favor of
the preferred pack.

Provide a handful of test cases in a new script (t5332) exercising
interesting behavior for multi-pack reuse to ensure that we performed
all of the previous steps correctly.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/config/pack.txt |   6 +-
 builtin/pack-objects.c        |   6 +-
 pack-bitmap.c                 |  73 +++++++++---
 pack-bitmap.h                 |   3 +-
 t/t5332-multi-pack-reuse.sh   | 219 ++++++++++++++++++++++++++++++++++
 5 files changed, 290 insertions(+), 17 deletions(-)
 create mode 100755 t/t5332-multi-pack-reuse.sh

diff --git a/Documentation/config/pack.txt b/Documentation/config/pack.txt
index fe100d0fb7..9fe48d41c9 100644
--- a/Documentation/config/pack.txt
+++ b/Documentation/config/pack.txt
@@ -30,7 +30,11 @@ to linkgit:git-repack[1].
 pack.allowPackReuse::
 	When true or "single", and when reachability bitmaps are enabled,
 	pack-objects will try to send parts of the bitmapped packfile
-	verbatim. This can reduce memory and CPU usage to serve fetches,
+	verbatim. When "multi", and when a multi-pack reachability bitmap is
+	available, pack-objects will try to send parts of all packs marked as
+	disjoint by the MIDX. If only a single pack bitmap is available, and
+	`pack.allowPackReuse` is set to "multi", reuse parts of just the
+	bitmapped packfile. This can reduce memory and CPU usage to serve fetches,
 	but might result in sending a slightly larger pack. Defaults to
 	true.
 
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 4853e91251..43b77bff7c 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -233,6 +233,7 @@ static int use_bitmap_index = -1;
 static enum {
 	NO_PACK_REUSE = 0,
 	SINGLE_PACK_REUSE,
+	MULTI_PACK_REUSE,
 } allow_pack_reuse = SINGLE_PACK_REUSE;
 static enum {
 	WRITE_BITMAP_FALSE = 0,
@@ -3253,6 +3254,8 @@ static int git_pack_config(const char *k, const char *v,
 		if (res < 0) {
 			if (!strcasecmp(v, "single"))
 				allow_pack_reuse = SINGLE_PACK_REUSE;
+			else if (!strcasecmp(v, "multi"))
+				allow_pack_reuse = MULTI_PACK_REUSE;
 			else
 				die(_("invalid pack.allowPackReuse value: '%s'"), v);
 		} else if (res) {
@@ -4032,7 +4035,8 @@ static int get_object_list_from_bitmap(struct rev_info *revs)
 		reuse_partial_packfile_from_bitmap(bitmap_git,
 						   &reuse_packfiles,
 						   &reuse_packfiles_nr,
-						   &reuse_packfile_bitmap);
+						   &reuse_packfile_bitmap,
+						   allow_pack_reuse == MULTI_PACK_REUSE);
 
 	if (reuse_packfiles) {
 		reuse_packfile_objects = bitmap_popcount(reuse_packfile_bitmap);
diff --git a/pack-bitmap.c b/pack-bitmap.c
index be53fc6da5..561690c679 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -2061,10 +2061,19 @@ static void reuse_partial_packfile_from_bitmap_1(struct bitmap_index *bitmap_git
 	unuse_pack(&w_curs);
 }
 
+static void make_disjoint_pack(struct bitmapped_pack *out, struct packed_git *p)
+{
+	out->p = p;
+	out->bitmap_pos = 0;
+	out->bitmap_nr = p->num_objects;
+	out->disjoint = 1;
+}
+
 void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 					struct bitmapped_pack **packs_out,
 					size_t *packs_nr_out,
-					struct bitmap **reuse_out)
+					struct bitmap **reuse_out,
+					int multi_pack_reuse)
 {
 	struct repository *r = the_repository;
 	struct bitmapped_pack *packs = NULL;
@@ -2088,24 +2097,62 @@ void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 				free(packs);
 				return;
 			}
-			if (!pack.bitmap_nr)
-				continue; /* no objects from this pack */
-			if (pack.bitmap_pos)
-				continue; /* not preferred pack */
+
+			if (!pack.disjoint)
+				continue;
+
+			if (!multi_pack_reuse && pack.bitmap_pos) {
+				/*
+				 * If we're only reusing a single pack, skip
+				 * over any packs which are not positioned at
+				 * the beginning of the MIDX bitmap.
+				 *
+				 * This is consistent with the existing
+				 * single-pack reuse behavior, which only reuses
+				 * parts of the MIDX's preferred pack.
+				 */
+				continue;
+			}
 
 			ALLOC_GROW(packs, packs_nr + 1, packs_alloc);
 			memcpy(&packs[packs_nr++], &pack, sizeof(pack));
 
 			objects_nr += pack.p->num_objects;
+
+			if (!multi_pack_reuse)
+				break;
+		}
+
+		if (!packs_nr) {
+			/*
+			 * Old MIDXs (i.e. those written before the "DISP" chunk
+			 * existed) will not have any packs marked as disjoint.
+			 *
+			 * But we still want to perform pack reuse with the
+			 * special "preferred pack" as before. To do this, form
+			 * the singleton set containing just the preferred pack,
+			 * which is trivially disjoint with itself.
+			 *
+			 * Moreover, the MIDX is guaranteed to resolve duplicate
+			 * objects in favor of the copy in the preferred pack
+			 * (if one exists). Thus, we can safely perform pack
+			 * reuse on this pack.
+			 */
+			uint32_t preferred_pack_pos;
+			struct packed_git *preferred_pack;
+
+			preferred_pack_pos = midx_preferred_pack(bitmap_git);
+			preferred_pack = bitmap_git->midx->packs[preferred_pack_pos];
+
+			ALLOC_GROW(packs, packs_nr + 1, packs_alloc);
+
+			make_disjoint_pack(&packs[packs_nr], preferred_pack);
+			objects_nr = packs[packs_nr++].p->num_objects;
 		}
 	} else {
 		ALLOC_GROW(packs, packs_nr + 1, packs_alloc);
 
-		packs[packs_nr].p = bitmap_git->pack;
-		packs[packs_nr].bitmap_pos = 0;
-		packs[packs_nr].bitmap_nr = bitmap_git->pack->num_objects;
-		packs[packs_nr].disjoint = 1;
-
+		make_disjoint_pack(&packs[packs_nr], bitmap_git->pack);
 		objects_nr = packs[packs_nr++].p->num_objects;
 	}
 
@@ -2114,10 +2161,8 @@ void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 		word_alloc++;
 	reuse = bitmap_word_alloc(word_alloc);
 
-	if (packs_nr != 1)
-		BUG("pack reuse not yet implemented for multiple packs");
-
-	reuse_partial_packfile_from_bitmap_1(bitmap_git, packs, reuse);
+	for (i = 0; i < packs_nr; i++)
+		reuse_partial_packfile_from_bitmap_1(bitmap_git, &packs[i], reuse);
 
 	if (!bitmap_popcount(reuse)) {
 		free(packs);
diff --git a/pack-bitmap.h b/pack-bitmap.h
index 901a3b86ed..8bb316ce52 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -81,7 +81,8 @@ uint32_t midx_preferred_pack(struct bitmap_index *bitmap_git);
 void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 					struct bitmapped_pack **packs_out,
 					size_t *packs_nr_out,
-					struct bitmap **reuse_out);
+					struct bitmap **reuse_out,
+					int multi_pack_reuse);
 int rebuild_existing_bitmaps(struct bitmap_index *, struct packing_data *mapping,
 			     kh_oid_map_t *reused_bitmaps, int show_progress);
 void free_bitmap_index(struct bitmap_index *);
diff --git a/t/t5332-multi-pack-reuse.sh b/t/t5332-multi-pack-reuse.sh
new file mode 100755
index 0000000000..a9bd3870e6
--- /dev/null
+++ b/t/t5332-multi-pack-reuse.sh
@@ -0,0 +1,219 @@
+#!/bin/sh
+
+test_description='pack-objects multi-pack reuse'
+
+. ./test-lib.sh
+. "$TEST_DIRECTORY"/lib-bitmap.sh
+. "$TEST_DIRECTORY"/lib-disjoint.sh
+
+objdir=.git/objects
+packdir=$objdir/pack
+
+all_packs () {
+	find $packdir -type f -name "*.idx" | sed -e 's/^.*\/\([^\/]\)/\1/g'
+}
+
+all_disjoint () {
+	all_packs | sed -e 's/^/+/g'
+}
+
+test_pack_reused () {
+	test_trace2_data pack-objects pack-reused "$1"
+}
+
+test_packs_reused () {
+	test_trace2_data pack-objects packs-reused "$1"
+}
+
+
+# pack_position <object> </path/to/pack.idx
+pack_position () {
+	git show-index >objects &&
+	grep "$1" objects | cut -d" " -f1
+}
+
+test_expect_success 'setup' '
+	git config pack.allowPackReuse multi
+'
+
+test_expect_success 'preferred pack is reused without packs marked disjoint' '
+	test_commit A &&
+	test_commit B &&
+
+	A="$(echo A | git pack-objects --unpacked --delta-base-offset $packdir/pack)" &&
+	B="$(echo B | git pack-objects --unpacked --delta-base-offset $packdir/pack)" &&
+
+	git prune-packed &&
+
+	git multi-pack-index write --bitmap &&
+
+	test_must_not_be_disjoint "pack-$A.pack" &&
+	test_must_not_be_disjoint "pack-$B.pack" &&
+
+	: >trace2.txt &&
+	GIT_TRACE2_EVENT="$PWD/trace2.txt" \
+		git pack-objects --stdout --revs --all >/dev/null &&
+
+	test_pack_reused 3 <trace2.txt &&
+	test_packs_reused 1 <trace2.txt
+'
+
+test_expect_success 'reuse all objects from subset of disjoint packs' '
+	test_commit C &&
+
+	C="$(echo C | git pack-objects --unpacked --delta-base-offset $packdir/pack)" &&
+
+	git prune-packed &&
+
+	cat >in <<-EOF &&
+	pack-$A.idx
+	+pack-$B.idx
+	+pack-$C.idx
+	EOF
+	git multi-pack-index write --bitmap --stdin-packs <in &&
+
+	test_must_not_be_disjoint "pack-$A.pack" &&
+	test_must_be_disjoint "pack-$B.pack" &&
+	test_must_be_disjoint "pack-$C.pack" &&
+
+	: >trace2.txt &&
+	GIT_TRACE2_EVENT="$PWD/trace2.txt" \
+		git pack-objects --stdout --revs --all >/dev/null &&
+
+	test_pack_reused 6 <trace2.txt &&
+	test_packs_reused 2 <trace2.txt
+'
+
+test_expect_success 'reuse all objects from all disjoint packs' '
+	rm -fr $packdir/multi-pack-index* &&
+
+	all_disjoint >in &&
+	git multi-pack-index write --bitmap --stdin-packs <in &&
+
+	test_must_be_disjoint "pack-$A.pack" &&
+	test_must_be_disjoint "pack-$B.pack" &&
+	test_must_be_disjoint "pack-$C.pack" &&
+
+	: >trace2.txt &&
+	GIT_TRACE2_EVENT="$PWD/trace2.txt" \
+		git pack-objects --stdout --revs --all >/dev/null &&
+
+	test_pack_reused 9 <trace2.txt &&
+	test_packs_reused 3 <trace2.txt
+'
+
+test_expect_success 'reuse objects from first disjoint pack with middle gap' '
+	test_commit D &&
+	test_commit E &&
+	test_commit F &&
+
+	# Set "pack.window" to zero to ensure that we do not create any
+	# deltas, which could alter the amount of pack reuse we perform
+	# (if, for e.g., we are not sending one or more bases).
+	D="$(git -c pack.window=0 pack-objects --all --unpacked $packdir/pack)" &&
+
+	d_pos="$(pack_position $(git rev-parse D) <$packdir/pack-$D.idx)" &&
+	e_pos="$(pack_position $(git rev-parse E) <$packdir/pack-$D.idx)" &&
+	f_pos="$(pack_position $(git rev-parse F) <$packdir/pack-$D.idx)" &&
+
+	# commits F, E, and D, should appear in that order at the
+	# beginning of the pack
+	test $f_pos -lt $e_pos &&
+	test $e_pos -lt $d_pos &&
+
+	# Ensure that the pack we are constructing sorts ahead of any
+	# other packs in lexical/bitmap order by choosing it as the
+	# preferred pack.
+	all_disjoint >in &&
+	git multi-pack-index write --bitmap --preferred-pack="pack-$D.idx" \
+		--stdin-packs <in &&
+
+	test_must_be_disjoint pack-$A.pack &&
+	test_must_be_disjoint pack-$B.pack &&
+	test_must_be_disjoint pack-$C.pack &&
+	test_must_be_disjoint pack-$D.pack &&
+
+	cat >in <<-EOF &&
+	$(git rev-parse E)
+	^$(git rev-parse D)
+	EOF
+
+	: >trace2.txt &&
+	GIT_TRACE2_EVENT="$PWD/trace2.txt" \
+		git pack-objects --stdout --delta-base-offset --revs <in >/dev/null &&
+
+	test_pack_reused 3 <trace2.txt &&
+	test_packs_reused 1 <trace2.txt
+'
+
+test_expect_success 'reuse objects from middle disjoint pack with middle gap' '
+	rm -fr $packdir/multi-pack-index* &&
+
+	# Ensure that the pack we are constructing sort into any
+	# position *but* the first one, by choosing a different pack as
+	# the preferred one.
+	all_disjoint >in &&
+	git multi-pack-index write --bitmap --preferred-pack="pack-$A.idx" \
+		--stdin-packs <in &&
+
+	cat >in <<-EOF &&
+	$(git rev-parse E)
+	^$(git rev-parse D)
+	EOF
+
+	: >trace2.txt &&
+	GIT_TRACE2_EVENT="$PWD/trace2.txt" \
+		git pack-objects --stdout --delta-base-offset --revs <in >/dev/null &&
+
+	test_pack_reused 3 <trace2.txt &&
+	test_packs_reused 1 <trace2.txt
+'
+
+test_expect_success 'omit delta with uninteresting base' '
+	git repack -adk &&
+
+	test_seq 32 >f &&
+	git add f &&
+	test_tick &&
+	git commit -m "delta" &&
+	delta="$(git rev-parse HEAD)" &&
+
+	test_seq 64 >f &&
+	test_tick &&
+	git commit -a -m "base" &&
+	base="$(git rev-parse HEAD)" &&
+
+	test_commit other &&
+
+	git repack -d &&
+
+	have_delta "$(git rev-parse $delta:f)" "$(git rev-parse $base:f)" &&
+
+	all_disjoint >in &&
+	git multi-pack-index write --bitmap --stdin-packs <in &&
+
+	cat >in <<-EOF &&
+	$(git rev-parse other)
+	^$base
+	EOF
+
+	: >trace2.txt &&
+	GIT_TRACE2_EVENT="$PWD/trace2.txt" \
+		git pack-objects --stdout --delta-base-offset --revs <in >/dev/null &&
+
+	# Even though all packs are marked disjoint, we can only reuse
+	# the 3 objects corresponding to "other" from the latest pack.
+	#
+	# This is because even though we want "delta", we do not want
+	# "base", meaning that we have to inflate the delta/base-pair
+	# corresponding to the blob in commit "delta", which bypasses
+	# the pack-reuse mechanism.
+	#
+	# The remaining objects from the other pack are similarly not
+	# reused because their objects are on the uninteresting side of
+	# the query.
+	test_pack_reused 3 <trace2.txt &&
+	test_packs_reused 1 <trace2.txt
+'
+
+test_done
-- 
2.43.0.24.g980b318f98


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 24/24] t/perf: add performance tests for multi-pack reuse
  2023-11-28 19:07 [PATCH 00/24] pack-objects: multi-pack verbatim reuse Taylor Blau
                   ` (22 preceding siblings ...)
  2023-11-28 19:08 ` [PATCH 23/24] pack-bitmap: reuse objects from all disjoint packs Taylor Blau
@ 2023-11-28 19:08 ` Taylor Blau
  2023-11-30 10:18 ` [PATCH 00/24] pack-objects: multi-pack verbatim reuse Patrick Steinhardt
                   ` (2 subsequent siblings)
  26 siblings, 0 replies; 107+ messages in thread
From: Taylor Blau @ 2023-11-28 19:08 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano

To ensure that we don't regress either the size or runtime performance
of multi-pack reuse, add a performance test to measure both of these.

The test partitions the objects in GIT_TEST_PERF_LARGE_REPO into 1, 10,
and 100 packs, and then tries to perform a "clone" at each stage with
both single- and multi-pack reuse enabled.

Note that the `repack_into_n_chunks()` function in this new test script
differs from the existing `repack_into_n()`. The former partitions the
repository into N equal-sized chunks, while the latter produces N packs
of five commits each (plus their objects), and then another pack with
the remainder.

On git.git, I can produce the following results on my machine:

    Test                                                            this tree
    --------------------------------------------------------------------------------
    5332.3: clone for 1-pack scenario (single-pack reuse)           1.57(2.99+0.15)
    5332.4: clone size for 1-pack scenario (single-pack reuse)               231.8M
    5332.5: clone for 1-pack scenario (multi-pack reuse)            1.79(2.96+0.21)
    5332.6: clone size for 1-pack scenario (multi-pack reuse)                231.7M
    5332.9: clone for 10-pack scenario (single-pack reuse)          3.89(16.75+0.35)
    5332.10: clone size for 10-pack scenario (single-pack reuse)             209.9M
    5332.11: clone for 10-pack scenario (multi-pack reuse)          1.56(2.99+0.17)
    5332.12: clone size for 10-pack scenario (multi-pack reuse)              224.4M
    5332.15: clone for 100-pack scenario (single-pack reuse)        8.24(54.31+0.59)
    5332.16: clone size for 100-pack scenario (single-pack reuse)            278.3M
    5332.17: clone for 100-pack scenario (multi-pack reuse)         2.13(2.44+0.33)
    5332.18: clone size for 100-pack scenario (multi-pack reuse)             357.9M

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/perf/p5332-multi-pack-reuse.sh | 81 ++++++++++++++++++++++++++++++++
 1 file changed, 81 insertions(+)
 create mode 100755 t/perf/p5332-multi-pack-reuse.sh

diff --git a/t/perf/p5332-multi-pack-reuse.sh b/t/perf/p5332-multi-pack-reuse.sh
new file mode 100755
index 0000000000..5c6c575d62
--- /dev/null
+++ b/t/perf/p5332-multi-pack-reuse.sh
@@ -0,0 +1,81 @@
+#!/bin/sh
+
+test_description='tests pack performance with multi-pack reuse'
+
+. ./perf-lib.sh
+. "${TEST_DIRECTORY}/perf/lib-pack.sh"
+
+packdir=.git/objects/pack
+
+test_perf_large_repo
+
+find_pack () {
+	for idx in $packdir/pack-*.idx
+	do
+		if git show-index <$idx | grep -q "$1"
+		then
+			basename $idx
+		fi || return 1
+	done
+}
+
+repack_into_n_chunks () {
+	git repack -adk &&
+
+	test "$1" -eq 1 && return ||
+
+	find $packdir -type f | sort >packs.before &&
+
+	# partition the repository into $1 chunks of consecutive commits, and
+	# then create $1 packs with the objects reachable from each chunk
+	# (excluding any objects reachable from the previous chunks)
+	sz="$(($(git rev-list --count --all) / $1))"
+	for rev in $(git rev-list --all | awk "NR % $sz == 0" | tac)
+	do
+		pack="$(echo "$rev" | git pack-objects --revs \
+			--honor-pack-keep --delta-base-offset $packdir/pack)" &&
+		touch $packdir/pack-$pack.keep || return 1
+	done
+
+	# grab any remaining objects not packed by the previous step(s)
+	git pack-objects --revs --all --honor-pack-keep --delta-base-offset \
+		$packdir/pack &&
+
+	find $packdir -type f | sort >packs.after &&
+
+	# and install the whole thing
+	for f in $(comm -12 packs.before packs.after)
+	do
+		rm -f "$f" || return 1
+	done
+	rm -fr $packdir/*.keep
+}
+
+for nr_packs in 1 10 100
+do
+	test_expect_success "create $nr_packs-pack scenario" '
+		repack_into_n_chunks $nr_packs
+	'
+
+	test_expect_success "setup bitmaps for $nr_packs-pack scenario" '
+		find $packdir -type f -name "*.idx" | sed -e "s/.*\/\(.*\)$/+\1/g" |
+		git multi-pack-index write --stdin-packs --bitmap \
+			--preferred-pack="$(find_pack $(git rev-parse HEAD))"
+	'
+
+	for reuse in single multi
+	do
+		test_perf "clone for $nr_packs-pack scenario ($reuse-pack reuse)" "
+			git for-each-ref --format='%(objectname)' refs/heads refs/tags >in &&
+			git -c pack.allowPackReuse=$reuse pack-objects \
+				--revs --delta-base-offset --use-bitmap-index \
+				--stdout <in >result
+		"
+
+		test_size "clone size for $nr_packs-pack scenario ($reuse-pack reuse)" '
+			wc -c <result
+		'
+	done
+done
+
+test_done
-- 
2.43.0.24.g980b318f98

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/24] pack-objects: multi-pack verbatim reuse
  2023-11-28 19:07 [PATCH 00/24] pack-objects: multi-pack verbatim reuse Taylor Blau
                   ` (23 preceding siblings ...)
  2023-11-28 19:08 ` [PATCH 24/24] t/perf: add performance tests for multi-pack reuse Taylor Blau
@ 2023-11-30 10:18 ` Patrick Steinhardt
  2023-11-30 19:39   ` Taylor Blau
  2023-12-12  8:12 ` Jeff King
  2023-12-14 22:23 ` [PATCH v2 00/26] " Taylor Blau
  26 siblings, 1 reply; 107+ messages in thread
From: Patrick Steinhardt @ 2023-11-30 10:18 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Junio C Hamano

[-- Attachment #1: Type: text/plain, Size: 10142 bytes --]

On Tue, Nov 28, 2023 at 02:07:54PM -0500, Taylor Blau wrote:
> Back in fff42755ef (pack-bitmap: add support for bitmap indexes,
> 2013-12-21), we added support for reachability bitmaps, and taught
> pack-objects how to reuse verbatim chunks from the bitmapped pack. When
> multi-pack bitmaps were introduced, this pack-reuse mechanism evolved to
> use the MIDX's "preferred" pack as the source for verbatim reuse.
> 
> This allows repositories to incrementally repack themselves (e.g., using
> a `--geometric` repack), storing the result in a MIDX, and generating a
> corresponding bitmap. This keeps our bitmap coverage up-to-date, while
> maintaining a relatively small number of packs.
> 
> However, it is recommended (and matches what we do in production at
> GitHub) that repositories repack themselves all-into-one, and
> generate a corresponding single-pack reachability bitmap. This is done
> for a couple of reasons, but the most relevant one to this series is
> that it enables us to perform verbatim pack-reuse over a complete copy
> of the repository, since the entire repository resides in a single pack
> (and thus is eligible for verbatim pack-reuse).
> 
> As repositories grow larger, packing their contents into a single pack
> becomes less feasible. This series extends the pack-reuse mechanism to
> operate over multiple packs which are known ahead of time to be disjoint
> with respect to one another's set of objects.
> 
> The implementation has a few components:
> 
>   - A new MIDX chunk called "Disjoint packfiles" or DISP is introduced
>     to keep track of the bitmap position, number of objects, and
>     disjointed-ness for each pack contained in the MIDX.
> 
>   - A new mode for `git multi-pack-index write --stdin-packs` that
>     allows specifying disjoint packs, as well as a new option
>     `--retain-disjoint` which preserves the set of existing disjoint
>     packs in the new MIDX.
> 
>   - A new pack-objects mode `--ignore-disjoint`, which produces packs
>     which are disjoint with respect to the current set of disjoint packs
>     (i.e. it discards any objects from the packing list which appear in
>     any of the known-disjoint packs).
> 
>   - A new repack mode, `--extend-disjoint` which causes any new pack(s)
>     which are generated to be disjoint with respect to the set of packs
>     currently marked as disjoint, minus any pack(s) which are about to
>     be deleted.
> 
> With all of that in place, the patch series then rewrites all of the
> pack-reuse functions in terms of the new `bitmapped_pack` structure.
> Once we have dropped all of the assumptions stemming from only
> performing pack-reuse over a single candidate pack, we can then enable
> reuse over all of the disjoint packs.
> 
> In addition to the many new tests in t5332 added by that series, I tried
> to simulate a "real world" test on git.git by breaking the repository
> into chunks of 1,000 commits (plus their set of reachable objects not
> reachable from earlier chunk(s)) and packing those chunks. This produces
> a large number of packs with the objects from git.git which are known to
> be disjoint with respect to one another.
> 
>     $ git clone git@github.com:git/git.git base
> 
>     $ cd base
>     $ mv .git/objects/pack/pack-*.idx{,.bak}
>     $ git unpack-objects <.git/objects/pack/pack-*.pack
> 
>     # pack the objects from each successive block of 1k commits
>     $ for rev in $(git rev-list --all | awk '(NR) % 1000 == 0' | tac)
>       do
>         echo $rev |
>         git.compile pack-objects --revs --unpacked .git/objects/pack/pack || return 1
>       done
>     # and grab any stragglers, pruning the unpacked objects
>     $ git repack -d
>     I then constructed a MIDX and corresponding bitmap
> 
>     $ find_pack () {
>         for idx in .git/objects/pack/pack-*.idx
>         do
>           git show-index <$idx | grep -q "$1" && basename $idx
>         done
>       }
>     $ preferred="$(find_pack $(git rev-parse HEAD))"
> 
>     $ ( cd .git/objects/pack && ls -1 *.idx ) | sed -e 's/^/+/g' |
>         git.compile multi-pack-index write --bitmap --stdin-packs \
>           --preferred-pack=$preferred
>     $ git for-each-ref --format='%(objectname)' refs/heads refs/tags >in
> 
> With all of that in place, I was able to produce a significant speed-up
> by reusing objects from multiple packs:
> 
>     $ hyperfine -L v single,multi -n '{v}-pack reuse' 'git.compile -c pack.allowPackReuse={v} pack-objects --revs --stdout --use-bitmap-index --delta-base-offset <in >/dev/null'
>     Benchmark 1: single-pack reuse
>       Time (mean ± σ):      6.094 s ±  0.023 s    [User: 43.723 s, System: 0.358 s]
>       Range (min … max):    6.063 s …  6.126 s    10 runs
> 
>     Benchmark 2: multi-pack reuse
>       Time (mean ± σ):     906.5 ms ±   3.2 ms    [User: 1081.5 ms, System: 30.9 ms]
>       Range (min … max):   903.5 ms … 912.7 ms    10 runs
> 
>     Summary
>       multi-pack reuse ran
>         6.72 ± 0.03 times faster than single-pack reuse
> 
> (There are corresponding tests in p5332 that test different sized chunks
> and measure the runtime performance as well as resulting pack size).
> 
> Performing verbatim pack reuse naturally trades off between CPU time and
> the resulting pack size. In the above example, the single-pack reuse
> case produces a clone size of ~194 MB on my machine, while the
> multi-pack reuse case produces a clone size closer to ~266 MB, which is
> a ~37% increase in clone size.

Quite exciting, and a tradeoff that may be worth it for Git hosters. I
expect that this is going to be an extreme example of the benefits
provided by your patch series -- do you by any chance also have "real"
numbers that make it possible to quantify the effect a bit better?

No worry if you don't, I'm just curious.

> I think there is still some opportunity to close this gap, since the
> "packing" strategy here is extremely naive. In a production setting, I'm
> sure that there are more well thought out repacking strategies that
> would produce more similar clone sizes.
> 
> I considered breaking this series up into smaller chunks, but was
> unsatisfied with the result. Since this series is rather large, if you
> have alternate suggestions on better ways to structure this, please let
> me know.

The series is indeed very involved to review. I only made it up to patch
8/24 and already spent quite some time on it. So I'd certainly welcome
it if this was split up into smaller parts, but don't have a suggestion
as to how this should be done (also because I didn't yet read the other
16 patches).

I'll review the remaining patches at a later point in time.

Patrick

> Thanks in advance for your review!
> 
> Taylor Blau (24):
>   pack-objects: free packing_data in more places
>   pack-bitmap-write: deep-clear the `bb_commit` slab
>   pack-bitmap: plug leak in find_objects()
>   midx: factor out `fill_pack_info()`
>   midx: implement `DISP` chunk
>   midx: implement `midx_locate_pack()`
>   midx: implement `--retain-disjoint` mode
>   pack-objects: implement `--ignore-disjoint` mode
>   repack: implement `--extend-disjoint` mode
>   pack-bitmap: pass `bitmapped_pack` struct to pack-reuse functions
>   pack-bitmap: simplify `reuse_partial_packfile_from_bitmap()` signature
>   pack-bitmap: return multiple packs via
>     `reuse_partial_packfile_from_bitmap()`
>   pack-objects: parameterize pack-reuse routines over a single pack
>   pack-objects: keep track of `pack_start` for each reuse pack
>   pack-objects: pass `bitmapped_pack`'s to pack-reuse functions
>   pack-objects: prepare `write_reused_pack()` for multi-pack reuse
>   pack-objects: prepare `write_reused_pack_verbatim()` for multi-pack
>     reuse
>   pack-objects: include number of packs reused in output
>   pack-bitmap: prepare to mark objects from multiple packs for reuse
>   pack-objects: add tracing for various packfile metrics
>   t/test-lib-functions.sh: implement `test_trace2_data` helper
>   pack-objects: allow setting `pack.allowPackReuse` to "single"
>   pack-bitmap: reuse objects from all disjoint packs
>   t/perf: add performance tests for multi-pack reuse
> 
>  Documentation/config/pack.txt          |   8 +-
>  Documentation/git-multi-pack-index.txt |  12 ++
>  Documentation/git-pack-objects.txt     |   8 +
>  Documentation/git-repack.txt           |  12 ++
>  Documentation/gitformat-pack.txt       | 109 ++++++++++
>  builtin/multi-pack-index.c             |  13 +-
>  builtin/pack-objects.c                 | 200 +++++++++++++++----
>  builtin/repack.c                       |  57 +++++-
>  midx.c                                 | 218 +++++++++++++++++---
>  midx.h                                 |  11 +-
>  pack-bitmap-write.c                    |   9 +-
>  pack-bitmap.c                          | 265 ++++++++++++++++++++-----
>  pack-bitmap.h                          |  18 +-
>  pack-objects.c                         |  15 ++
>  pack-objects.h                         |   1 +
>  t/helper/test-read-midx.c              |  31 ++-
>  t/lib-disjoint.sh                      |  49 +++++
>  t/perf/p5332-multi-pack-reuse.sh       |  81 ++++++++
>  t/t5319-multi-pack-index.sh            | 140 +++++++++++++
>  t/t5331-pack-objects-stdin.sh          | 156 +++++++++++++++
>  t/t5332-multi-pack-reuse.sh            | 219 ++++++++++++++++++++
>  t/t6113-rev-list-bitmap-filters.sh     |   2 +
>  t/t7700-repack.sh                      |   4 +-
>  t/t7705-repack-extend-disjoint.sh      | 142 +++++++++++++
>  t/test-lib-functions.sh                |  14 ++
>  25 files changed, 1650 insertions(+), 144 deletions(-)
>  create mode 100644 t/lib-disjoint.sh
>  create mode 100755 t/perf/p5332-multi-pack-reuse.sh
>  create mode 100755 t/t5332-multi-pack-reuse.sh
>  create mode 100755 t/t7705-repack-extend-disjoint.sh
> 
> 
> base-commit: 564d0252ca632e0264ed670534a51d18a689ef5d
> -- 
> 2.43.0.24.g980b318f98

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 01/24] pack-objects: free packing_data in more places
  2023-11-28 19:07 ` [PATCH 01/24] pack-objects: free packing_data in more places Taylor Blau
@ 2023-11-30 10:18   ` Patrick Steinhardt
  2023-11-30 19:08     ` Taylor Blau
  0 siblings, 1 reply; 107+ messages in thread
From: Patrick Steinhardt @ 2023-11-30 10:18 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Junio C Hamano

[-- Attachment #1: Type: text/plain, Size: 3185 bytes --]

On Tue, Nov 28, 2023 at 02:07:57PM -0500, Taylor Blau wrote:
> The pack-objects internals use a packing_data struct to track what
> objects are part of the pack(s) being formed.
> 
> Since these structures contain allocated fields, failing to
> appropriately free() them results in a leak. Plug that leak by
> introducing a free_packing_data() function, and call it in the
> appropriate spots.
> 
> This is a fairly straightforward leak to plug, since none of the callers
> expect to read any values or have any references to parts of the address
> space being freed.
> 
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
>  builtin/pack-objects.c |  1 +
>  midx.c                 |  5 +++++
>  pack-objects.c         | 15 +++++++++++++++
>  pack-objects.h         |  1 +
>  4 files changed, 22 insertions(+)
> 
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index 89a8b5a976..bfa60359d4 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -4522,6 +4522,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
>  			   reuse_packfile_objects);
>  
>  cleanup:
> +	free_packing_data(&to_pack);
>  	list_objects_filter_release(&filter_options);
>  	strvec_clear(&rp);
>  
> diff --git a/midx.c b/midx.c
> index 2f3863c936..3b727dc633 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -1592,8 +1592,13 @@ static int write_midx_internal(const char *object_dir,
>  				      flags) < 0) {
>  			error(_("could not write multi-pack bitmap"));
>  			result = 1;
> +			free_packing_data(&pdata);
> +			free(commits);
>  			goto cleanup;
>  		}
> +
> +		free_packing_data(&pdata);
> +		free(commits);
>  	}
>  	/*
>  	 * NOTE: Do not use ctx.entries beyond this point, since it might
> diff --git a/pack-objects.c b/pack-objects.c
> index f403ca6986..1c7bedcc94 100644
> --- a/pack-objects.c
> +++ b/pack-objects.c
> @@ -151,6 +151,21 @@ void prepare_packing_data(struct repository *r, struct packing_data *pdata)
>  	init_recursive_mutex(&pdata->odb_lock);
>  }
>  
> +void free_packing_data(struct packing_data *pdata)

Nit: shouldn't this rather be called `clear_packing_data`? `free` to me
indicates that the data structure itself will be free'd, as well, which
is not the case.

Patrick

> +{
> +	if (!pdata)
> +		return;
> +
> +	free(pdata->cruft_mtime);
> +	free(pdata->in_pack);
> +	free(pdata->in_pack_by_idx);
> +	free(pdata->in_pack_pos);
> +	free(pdata->index);
> +	free(pdata->layer);
> +	free(pdata->objects);
> +	free(pdata->tree_depth);
> +}
> +
>  struct object_entry *packlist_alloc(struct packing_data *pdata,
>  				    const struct object_id *oid)
>  {
> diff --git a/pack-objects.h b/pack-objects.h
> index 0d78db40cb..336217e8cd 100644
> --- a/pack-objects.h
> +++ b/pack-objects.h
> @@ -169,6 +169,7 @@ struct packing_data {
>  };
>  
>  void prepare_packing_data(struct repository *r, struct packing_data *pdata);
> +void free_packing_data(struct packing_data *pdata);
>  
>  /* Protect access to object database */
>  static inline void packing_data_lock(struct packing_data *pdata)
> -- 
> 2.43.0.24.g980b318f98
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 02/24] pack-bitmap-write: deep-clear the `bb_commit` slab
  2023-11-28 19:07 ` [PATCH 02/24] pack-bitmap-write: deep-clear the `bb_commit` slab Taylor Blau
@ 2023-11-30 10:18   ` Patrick Steinhardt
  2023-11-30 19:11     ` Taylor Blau
  2023-12-12  7:04   ` Jeff King
  1 sibling, 1 reply; 107+ messages in thread
From: Patrick Steinhardt @ 2023-11-30 10:18 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Junio C Hamano

[-- Attachment #1: Type: text/plain, Size: 1799 bytes --]

On Tue, Nov 28, 2023 at 02:07:59PM -0500, Taylor Blau wrote:
> The `bb_commit` commit slab is used by the pack-bitmap-write machinery
> to track various pieces of bookkeeping used to generate reachability
> bitmaps.
> 
> Even though we clear the slab when freeing the bitmap_builder struct
> (with `bitmap_builder_clear()`), there are still pointers which point to
> locations in memory that have not yet been freed, resulting in a leak.
> 
> Plug the leak by introducing a suitable `free_fn` for the `struct
> bb_commit` type, and make sure it is called on each member of the slab
> via the `deep_clear_bb_data()` function.
> 
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
>  pack-bitmap-write.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c
> index f4ecdf8b0e..dd3a415b9d 100644
> --- a/pack-bitmap-write.c
> +++ b/pack-bitmap-write.c
> @@ -198,6 +198,13 @@ struct bb_commit {
>  	unsigned idx; /* within selected array */
>  };
>  
> +static void clear_bb_commit(struct bb_commit *commit)
> +{
> +	free(commit->reverse_edges);

I'd have expected to see `free_commit_list()` here instead of a simple
free. Is there any reason why we don't use it?

Patrick

> +	bitmap_free(commit->commit_mask);
> +	bitmap_free(commit->bitmap);
> +}
> +
>  define_commit_slab(bb_data, struct bb_commit);
>  
>  struct bitmap_builder {
> @@ -339,7 +346,7 @@ static void bitmap_builder_init(struct bitmap_builder *bb,
>  
>  static void bitmap_builder_clear(struct bitmap_builder *bb)
>  {
> -	clear_bb_data(&bb->data);
> +	deep_clear_bb_data(&bb->data, clear_bb_commit);
>  	free(bb->commits);
>  	bb->commits_nr = bb->commits_alloc = 0;
>  }
> -- 
> 2.43.0.24.g980b318f98
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 04/24] midx: factor out `fill_pack_info()`
  2023-11-28 19:08 ` [PATCH 04/24] midx: factor out `fill_pack_info()` Taylor Blau
@ 2023-11-30 10:18   ` Patrick Steinhardt
  2023-11-30 19:19     ` Taylor Blau
  0 siblings, 1 reply; 107+ messages in thread
From: Patrick Steinhardt @ 2023-11-30 10:18 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Junio C Hamano

[-- Attachment #1: Type: text/plain, Size: 4502 bytes --]

On Tue, Nov 28, 2023 at 02:08:05PM -0500, Taylor Blau wrote:
> When selecting which packfiles will be written while generating a MIDX,
> the MIDX internals fill out a 'struct pack_info' with various pieces of
> book-keeping.
> 
> Instead of filling out each field of the `pack_info` structure
> individually in each of the two spots that modify the array of such
> structures (`ctx->info`), extract a common routine that does this for
> us.
> 
> This reduces the code duplication by a modest amount. But more
> importantly, it zero-initializes the structure before assigning values
> into it. This hardens us for a future change which will add additional
> fields to this structure which (until this patch) was not
> zero-initialized.
> 
> As a result, any new fields added to the `pack_info` structure need only
> be updated in a single location, instead of at each spot within midx.c.
> 
> There are no functional changes in this patch.
> 
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
>  midx.c | 35 +++++++++++++++++++----------------
>  1 file changed, 19 insertions(+), 16 deletions(-)
> 
> diff --git a/midx.c b/midx.c
> index 3b727dc633..591b3c636e 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -464,6 +464,17 @@ struct pack_info {
>  	unsigned expired : 1;
>  };
>  
> +static void fill_pack_info(struct pack_info *info,
> +			   struct packed_git *p, char *pack_name,
> +			   uint32_t orig_pack_int_id)
> +{
> +	memset(info, 0, sizeof(struct pack_info));
> +
> +	info->orig_pack_int_id = orig_pack_int_id;
> +	info->pack_name = pack_name;
> +	info->p = p;
> +}

Nit: all callers manually call `xstrdup(pack_name)` and pass that to
`fill_pack_info()`. We could consider doing this in here instead so that
ownership of the string becomes a tad clearer.

>  static int pack_info_compare(const void *_a, const void *_b)
>  {
>  	struct pack_info *a = (struct pack_info *)_a;
> @@ -504,6 +515,7 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
>  			     const char *file_name, void *data)
>  {
>  	struct write_midx_context *ctx = data;
> +	struct packed_git *p;
>  
>  	if (ends_with(file_name, ".idx")) {
>  		display_progress(ctx->progress, ++ctx->pack_paths_checked);
> @@ -530,17 +542,14 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
>  
>  		ALLOC_GROW(ctx->info, ctx->nr + 1, ctx->alloc);
>  
> -		ctx->info[ctx->nr].p = add_packed_git(full_path,
> -						      full_path_len,
> -						      0);
> -
> -		if (!ctx->info[ctx->nr].p) {
> +		p = add_packed_git(full_path, full_path_len, 0);
> +		if (!p) {
>  			warning(_("failed to add packfile '%s'"),
>  				full_path);
>  			return;
>  		}
>  
> -		if (open_pack_index(ctx->info[ctx->nr].p)) {
> +		if (open_pack_index(p)) {
>  			warning(_("failed to open pack-index '%s'"),
>  				full_path);
>  			close_pack(ctx->info[ctx->nr].p);

Isn't `ctx->info[ctx->nr].p` still uninitialized at this point?

> @@ -548,9 +557,8 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
>  			return;
>  		}
>  
> -		ctx->info[ctx->nr].pack_name = xstrdup(file_name);
> -		ctx->info[ctx->nr].orig_pack_int_id = ctx->nr;
> -		ctx->info[ctx->nr].expired = 0;
> +		fill_pack_info(&ctx->info[ctx->nr], p, xstrdup(file_name),
> +			       ctx->nr);
>  		ctx->nr++;
>  	}
>  }
> @@ -1310,11 +1318,6 @@ static int write_midx_internal(const char *object_dir,
>  		for (i = 0; i < ctx.m->num_packs; i++) {
>  			ALLOC_GROW(ctx.info, ctx.nr + 1, ctx.alloc);
>  
> -			ctx.info[ctx.nr].orig_pack_int_id = i;
> -			ctx.info[ctx.nr].pack_name = xstrdup(ctx.m->pack_names[i]);
> -			ctx.info[ctx.nr].p = ctx.m->packs[i];
> -			ctx.info[ctx.nr].expired = 0;
> -
>  			if (flags & MIDX_WRITE_REV_INDEX) {
>  				/*
>  				 * If generating a reverse index, need to have
> @@ -1330,10 +1333,10 @@ static int write_midx_internal(const char *object_dir,
>  				if (open_pack_index(ctx.m->packs[i]))
>  					die(_("could not open index for %s"),
>  					    ctx.m->packs[i]->pack_name);
> -				ctx.info[ctx.nr].p = ctx.m->packs[i];

Just to make sure I'm not missing anything, but this assignment here was
basically redundant before this patch already, right?

Patrick

>  			}
>  
> -			ctx.nr++;
> +			fill_pack_info(&ctx.info[ctx.nr++], ctx.m->packs[i],
> +				       xstrdup(ctx.m->pack_names[i]), i);
>  		}
>  	}
>  
> -- 
> 2.43.0.24.g980b318f98
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 05/24] midx: implement `DISP` chunk
  2023-11-28 19:08 ` [PATCH 05/24] midx: implement `DISP` chunk Taylor Blau
@ 2023-11-30 10:18   ` Patrick Steinhardt
  2023-11-30 19:27     ` Taylor Blau
  2023-12-03 13:15   ` Junio C Hamano
  1 sibling, 1 reply; 107+ messages in thread
From: Patrick Steinhardt @ 2023-11-30 10:18 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Junio C Hamano

[-- Attachment #1: Type: text/plain, Size: 27233 bytes --]

On Tue, Nov 28, 2023 at 02:08:08PM -0500, Taylor Blau wrote:
> When a multi-pack bitmap is used to implement verbatim pack reuse (that
> is, when verbatim chunks from an on-disk packfile are copied
> directly[^1]), it does so by using its "preferred pack" as the source
> for pack-reuse.
> 
> This allows repositories to pack the majority of their objects into a
> single (often large) pack, and then use it as the single source for
> verbatim pack reuse. This increases the amount of objects that are
> reused verbatim (and consequently, decrease the amount of time it takes
> to generate many packs). But this performance comes at a cost, which is
> that the preferred packfile must pace its growth with that of the entire
> repository in order to maintain the utility of verbatim pack reuse.
> 
> As repositories grow beyond what we can reasonably store in a single
> packfile, the utility of verbatim pack reuse diminishes. Or, at the very
> least, it becomes increasingly more expensive to maintain as the pack
> grows larger and larger.
> 
> It would be beneficial to be able to perform this same optimization over
> multiple packs, provided some modest constraints (most importantly, that
> the set of packs eligible for verbatim reuse are disjoint with respect
> to the objects that they contain).
> 
> If we assume that the packs which we treat as candidates for verbatim
> reuse are disjoint with respect to their objects, we need to make only
> modest modifications to the verbatim pack-reuse code itself. Most
> notably, we need to remove the assumption that the bits in the
> reachability bitmap corresponding to objects from the single reuse pack
> begin at the first bit position.
> 
> Future patches will unwind these assumptions and reimplement their
> existing functionality as special cases of the more general assumptions
> (e.g. that reuse bits can start anywhere within the bitset, but happen
> to start at 0 for all existing cases).
> 
> This patch does not yet relax any of those assumptions. Instead, it
> implements a foundational data-structure, the "Disjoint Packs" (`DISP`)
> chunk of the multi-pack index. The `DISP` chunk's contents are described
> in detail here. Importantly, the `DISP` chunk contains information to
> map regions of a multi-pack index's reachability bitmap to the packs
> whose objects they represent.
> 
> For now, this chunk is only written, not read (outside of the test-tool
> used in this patch to test the new chunk's behavior). Future patches
> will begin to make use of this new chunk.
> 
> This patch implements reading (though no callers outside of the above
> one perform any reading) and writing this new chunk. It also extends the
> `--stdin-packs` format used by the `git multi-pack-index write` builtin
> to be able to designate that a given pack is to be marked as "disjoint"
> by prefixing it with a '+' character.
> 
> [^1]: Modulo patching any `OFS_DELTA`'s that cross over a region of the
>   pack that wasn't used verbatim.
> 
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
>  Documentation/git-multi-pack-index.txt |   4 +
>  Documentation/gitformat-pack.txt       | 109 +++++++++++++++++++++++
>  builtin/multi-pack-index.c             |  10 ++-
>  midx.c                                 | 116 ++++++++++++++++++++++---
>  midx.h                                 |   5 ++
>  pack-bitmap.h                          |   9 ++
>  t/helper/test-read-midx.c              |  31 ++++++-
>  t/t5319-multi-pack-index.sh            |  58 +++++++++++++
>  8 files changed, 325 insertions(+), 17 deletions(-)
> 
> diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
> index 3696506eb3..d130e65b28 100644
> --- a/Documentation/git-multi-pack-index.txt
> +++ b/Documentation/git-multi-pack-index.txt
> @@ -49,6 +49,10 @@ write::
>  	--stdin-packs::
>  		Write a multi-pack index containing only the set of
>  		line-delimited pack index basenames provided over stdin.
> +		Lines beginning with a '+' character (followed by the
> +		pack index basename as before) have their pack marked as
> +		"disjoint". See the "`DISP` chunk and disjoint packs"
> +		section in linkgit:gitformat-pack[5] for more.
>  
>  	--refs-snapshot=<path>::
>  		With `--bitmap`, optionally specify a file which
> diff --git a/Documentation/gitformat-pack.txt b/Documentation/gitformat-pack.txt
> index 9fcb29a9c8..658682ddd5 100644
> --- a/Documentation/gitformat-pack.txt
> +++ b/Documentation/gitformat-pack.txt
> @@ -396,6 +396,22 @@ CHUNK DATA:
>  	    is padded at the end with between 0 and 3 NUL bytes to make the
>  	    chunk size a multiple of 4 bytes.
>  
> +	Disjoint Packfiles (ID: {'D', 'I', 'S', 'P'})
> +	    Stores a table of three 4-byte unsigned integers in network order.
> +	    Each table entry corresponds to a single pack (in the order that
> +	    they appear above in the `PNAM` chunk). The values for each table
> +	    entry are as follows:
> +	    - The first bit position (in psuedo-pack order, see below) to

s/psuedo/pseudo/

> +	      contain an object from that pack.
> +	    - The number of bits whose objects are selected from that pack.
> +	    - A "meta" value, whose least-significant bit indicates whether or
> +	      not the pack is disjoint with respect to other packs. The
> +	      remaining bits are unused.
> +	    Two packs are "disjoint" with respect to one another when they have
> +	    disjoint sets of objects. In other words, any object found in a pack
> +	    contained in the set of disjoint packfiles is guaranteed to be
> +	    uniquely located among those packs.
> +
>  	OID Fanout (ID: {'O', 'I', 'D', 'F'})
>  	    The ith entry, F[i], stores the number of OIDs with first
>  	    byte at most i. Thus F[255] stores the total
> @@ -509,6 +525,99 @@ packs arranged in MIDX order (with the preferred pack coming first).
>  The MIDX's reverse index is stored in the optional 'RIDX' chunk within
>  the MIDX itself.
>  
> +=== `DISP` chunk and disjoint packs
> +
> +The Disjoint Packfiles (`DISP`) chunk encodes additional information
> +about the objects in the multi-pack index's reachability bitmap. Recall
> +that objects from the MIDX are arranged in "pseudo-pack" order (see:

The colon feels a bit out-of-place here, so: s/see:/see/

> +above) for reachability bitmaps.
> +
> +From the example above, suppose we have packs "a", "b", and "c", with
> +10, 15, and 20 objects, respectively. In pseudo-pack order, those would
> +be arranged as follows:
> +
> +    |a,0|a,1|...|a,9|b,0|b,1|...|b,14|c,0|c,1|...|c,19|
> +
> +When working with single-pack bitmaps (or, equivalently, multi-pack
> +reachability bitmaps without any packs marked as disjoint),
> +linkgit:git-pack-objects[1] performs ``verbatim'' reuse, attempting to
> +reuse chunks of the existing packfile instead of adding objects to the
> +packing list.

I'm not sure I full understand this paragraph. In the context of a
single pack bitmap it's clear enough. But I stumbled over the MIDX case,
because here we potentially have multiple packfiles, so it's not exactly
clear to me what you refer to with "the existing packfile" in that case.
I'd think that we perform verbatim reuse of the preferred packfile,
right? If so, we might want to make that a bit more explicit.

> +When a chunk of bytes are reused from an existing pack, any objects

s/are/is/, as it refers to the single chunk and not the plural bytes.

> +contained therein do not need to be added to the packing list, saving
> +memory and CPU time. But a chunk from an existing packfile can only be
> +reused when the following conditions are met:
> +
> +  - The chunk contains only objects which were requested by the caller
> +    (i.e. does not contain any objects which the caller didn't ask for
> +    explicitly or implicitly).
> +
> +  - All objects stored as offset- or reference-deltas also include their
> +    base object in the resulting pack.
> +
> +Additionally, packfiles many not contain more than one copy of any given

s/many/may

> +object. This introduces an additional constraint over the set of packs
> +we may want to reuse. The most straightforward approach is to mandate
> +that the set of packs is disjoint with respect to the set of objects
> +contained in each pack. In other words, for each object `o` in the union
> +of all objects stored by the disjoint set of packs, `o` is contained in
> +exactly one pack from the disjoint set.

Is this a property that usually holds for our normal housekeeping, or
does it require careful managing by the user/admin? How about geometric
repacking?

> +One alternative design choice for multi-pack reuse might instead involve
> +imposing a chunk-level constraint that allows packs in the reusable set
> +to contain multiple copies across different packs, but restricts each
> +chunk against including more than one copy of such an object. This is in
> +theory possible to implement, but significantly more complicated than
> +forcing packs themselves to be disjoint. Most notably, we would have to
> +keep track of which objects have already been sent during verbatim
> +pack-reuse, defeating the main purpose of verbatim pack reuse (that we
> +don't have to keep track of individual objects).
> +
> +The `DISP` chunk encodes the necessary information in order to implement
> +multi-pack reuse over a disjoint set of packs as described above.
> +Specifically, the `DISP` chunk encodes three pieces of information (all
> +32-bit unsigned integers in network byte-order) for each packfile `p`
> +that is stored in the MIDX, as follows:
> +
> +`bitmap_pos`:: The first bit position (in pseudo-pack order) in the
> +  multi-pack index's reachability bitmap occupied by an object from `p`.
> +
> +`bitmap_nr`:: The number of bit positions (including the one at
> +  `bitmap_pos`) that encode objects from that pack `p`.
> +
> +`disjoint`:: Metadata, including whether or not the pack `p` is
> +  ``disjoint''. The least significant bit stores whether or not the pack
> +  is disjoint. The remaining bits are reserved for future use.
> +
> +For example, the `DISP` chunk corresponding to the above example (with
> +packs ``a'', ``b'', and ``c'') would look like:
> +
> +[cols="1,2,2,2"]
> +|===
> +| |`bitmap_pos` |`bitmap_nr` |`disjoint`
> +
> +|packfile ``a''
> +|`0`
> +|`10`
> +|`0x1`
> +
> +|packfile ``b''
> +|`10`
> +|`15`
> +|`0x1`
> +
> +|packfile ``c''
> +|`25`
> +|`20`
> +|`0x1`
> +|===
> +
> +With these constraints and information in place, we can treat each
> +packfile marked as disjoint as individually reusable in the same fashion
> +as verbatim pack reuse is performed on individual packs prior to the
> +implementation of the `DISP` chunk.
> +
>  == cruft packs
>  
>  The cruft packs feature offer an alternative to Git's traditional mechanism of
> diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
> index a72aebecaa..0f1dd4651d 100644
> --- a/builtin/multi-pack-index.c
> +++ b/builtin/multi-pack-index.c
> @@ -106,11 +106,17 @@ static int git_multi_pack_index_write_config(const char *var, const char *value,
>  	return 0;
>  }
>  
> +#define DISJOINT ((void*)(uintptr_t)1)
> +
>  static void read_packs_from_stdin(struct string_list *to)
>  {
>  	struct strbuf buf = STRBUF_INIT;
> -	while (strbuf_getline(&buf, stdin) != EOF)
> -		string_list_append(to, buf.buf);
> +	while (strbuf_getline(&buf, stdin) != EOF) {
> +		if (*buf.buf == '+')
> +			string_list_append(to, buf.buf + 1)->util = DISJOINT;
> +		else
> +			string_list_append(to, buf.buf);
> +	}
>  	string_list_sort(to);
>  
>  	strbuf_release(&buf);
> diff --git a/midx.c b/midx.c
> index 591b3c636e..f55020072f 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -33,6 +33,7 @@
>  
>  #define MIDX_CHUNK_ALIGNMENT 4
>  #define MIDX_CHUNKID_PACKNAMES 0x504e414d /* "PNAM" */
> +#define MIDX_CHUNKID_DISJOINTPACKS 0x44495350 /* "DISP" */
>  #define MIDX_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
>  #define MIDX_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
>  #define MIDX_CHUNKID_OBJECTOFFSETS 0x4f4f4646 /* "OOFF" */
> @@ -182,6 +183,9 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
>  
>  	pair_chunk(cf, MIDX_CHUNKID_LARGEOFFSETS, &m->chunk_large_offsets,
>  		   &m->chunk_large_offsets_len);
> +	pair_chunk(cf, MIDX_CHUNKID_DISJOINTPACKS,
> +		   (const unsigned char **)&m->chunk_disjoint_packs,
> +		   &m->chunk_disjoint_packs_len);
>  
>  	if (git_env_bool("GIT_TEST_MIDX_READ_RIDX", 1))
>  		pair_chunk(cf, MIDX_CHUNKID_REVINDEX, &m->chunk_revindex,
> @@ -275,6 +279,23 @@ int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, uint32_t
>  	return 0;
>  }
>  
> +int nth_bitmapped_pack(struct repository *r, struct multi_pack_index *m,
> +		       struct bitmapped_pack *bp, uint32_t pack_int_id)
> +{
> +	if (!m->chunk_disjoint_packs)
> +		return error(_("MIDX does not contain the DISP chunk"));
> +
> +	if (prepare_midx_pack(r, m, pack_int_id))
> +		return error(_("could not load disjoint pack %"PRIu32), pack_int_id);
> +
> +	bp->p = m->packs[pack_int_id];
> +	bp->bitmap_pos = get_be32(m->chunk_disjoint_packs + 3 * pack_int_id);
> +	bp->bitmap_nr = get_be32(m->chunk_disjoint_packs + 3 * pack_int_id + 1);
> +	bp->disjoint = !!get_be32(m->chunk_disjoint_packs + 3 * pack_int_id + 2);
> +
> +	return 0;
> +}
> +
>  int bsearch_midx(const struct object_id *oid, struct multi_pack_index *m, uint32_t *result)
>  {
>  	return bsearch_hash(oid->hash, m->chunk_oid_fanout, m->chunk_oid_lookup,
> @@ -457,11 +478,18 @@ static size_t write_midx_header(struct hashfile *f,
>  	return MIDX_HEADER_SIZE;
>  }
>  
> +#define BITMAP_POS_UNKNOWN (~((uint32_t)0))
> +
>  struct pack_info {
>  	uint32_t orig_pack_int_id;
>  	char *pack_name;
>  	struct packed_git *p;
> -	unsigned expired : 1;
> +
> +	uint32_t bitmap_pos;
> +	uint32_t bitmap_nr;
> +
> +	unsigned expired : 1,
> +		 disjoint : 1;
>  };
>  
>  static void fill_pack_info(struct pack_info *info,
> @@ -473,6 +501,7 @@ static void fill_pack_info(struct pack_info *info,
>  	info->orig_pack_int_id = orig_pack_int_id;
>  	info->pack_name = pack_name;
>  	info->p = p;
> +	info->bitmap_pos = BITMAP_POS_UNKNOWN;
>  }
>  
>  static int pack_info_compare(const void *_a, const void *_b)
> @@ -516,6 +545,7 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
>  {
>  	struct write_midx_context *ctx = data;
>  	struct packed_git *p;
> +	struct string_list_item *item = NULL;
>  
>  	if (ends_with(file_name, ".idx")) {
>  		display_progress(ctx->progress, ++ctx->pack_paths_checked);
> @@ -534,11 +564,13 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
>  		 * should be performed independently (likely checking
>  		 * to_include before the existing MIDX).
>  		 */
> -		if (ctx->m && midx_contains_pack(ctx->m, file_name))
> -			return;
> -		else if (ctx->to_include &&
> -			 !string_list_has_string(ctx->to_include, file_name))
> +		if (ctx->m && midx_contains_pack(ctx->m, file_name)) {
>  			return;
> +		} else if (ctx->to_include) {
> +			item = string_list_lookup(ctx->to_include, file_name);
> +			if (!item)
> +				return;
> +		}
>  
>  		ALLOC_GROW(ctx->info, ctx->nr + 1, ctx->alloc);
>  
> @@ -559,6 +591,8 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
>  
>  		fill_pack_info(&ctx->info[ctx->nr], p, xstrdup(file_name),
>  			       ctx->nr);
> +		if (item)
> +			ctx->info[ctx->nr].disjoint = !!item->util;
>  		ctx->nr++;
>  	}
>  }
> @@ -568,7 +602,8 @@ struct pack_midx_entry {
>  	uint32_t pack_int_id;
>  	time_t pack_mtime;
>  	uint64_t offset;
> -	unsigned preferred : 1;
> +	unsigned preferred : 1,
> +		 disjoint : 1;
>  };
>  
>  static int midx_oid_compare(const void *_a, const void *_b)
> @@ -586,6 +621,12 @@ static int midx_oid_compare(const void *_a, const void *_b)
>  	if (a->preferred < b->preferred)
>  		return 1;
>  
> +	/* Sort objects in a disjoint pack last when multiple copies exist. */
> +	if (a->disjoint < b->disjoint)
> +		return -1;
> +	if (a->disjoint > b->disjoint)
> +		return 1;
> +
>  	if (a->pack_mtime > b->pack_mtime)
>  		return -1;
>  	else if (a->pack_mtime < b->pack_mtime)
> @@ -671,6 +712,7 @@ static void midx_fanout_add_midx_fanout(struct midx_fanout *fanout,
>  					   &fanout->entries[fanout->nr],
>  					   cur_object);
>  		fanout->entries[fanout->nr].preferred = 0;
> +		fanout->entries[fanout->nr].disjoint = 0;
>  		fanout->nr++;
>  	}
>  }
> @@ -696,6 +738,7 @@ static void midx_fanout_add_pack_fanout(struct midx_fanout *fanout,
>  				cur_object,
>  				&fanout->entries[fanout->nr],
>  				preferred);
> +		fanout->entries[fanout->nr].disjoint = info->disjoint;
>  		fanout->nr++;
>  	}
>  }
> @@ -764,14 +807,22 @@ static struct pack_midx_entry *get_sorted_entries(struct multi_pack_index *m,
>  		 * Take only the first duplicate.
>  		 */
>  		for (cur_object = 0; cur_object < fanout.nr; cur_object++) {
> -			if (cur_object && oideq(&fanout.entries[cur_object - 1].oid,
> -						&fanout.entries[cur_object].oid))
> -				continue;
> +			struct pack_midx_entry *ours = &fanout.entries[cur_object];
> +			if (cur_object) {
> +				struct pack_midx_entry *prev = &fanout.entries[cur_object - 1];
> +				if (oideq(&prev->oid, &ours->oid)) {
> +					if (prev->disjoint && ours->disjoint)
> +						die(_("duplicate object '%s' among disjoint packs '%s', '%s'"),
> +						    oid_to_hex(&prev->oid),
> +						    info[prev->pack_int_id].pack_name,
> +						    info[ours->pack_int_id].pack_name);

Shouldn't we die if `prev->disjoint || ours->disjoint` instead of `&&`?
Even if one of the packs isn't marked as disjoint, it's still wrong if
the other one is and one of its objects exists in multiple packs.

Or am I misunderstanding, and we only guarantee the disjoint property
across packfiles that are actually marked as such?

Patrick

> +					continue;
> +				}
> +			}
>  
>  			ALLOC_GROW(deduplicated_entries, st_add(*nr_objects, 1),
>  				   alloc_objects);
> -			memcpy(&deduplicated_entries[*nr_objects],
> -			       &fanout.entries[cur_object],
> +			memcpy(&deduplicated_entries[*nr_objects], ours,
>  			       sizeof(struct pack_midx_entry));
>  			(*nr_objects)++;
>  		}
> @@ -814,6 +865,27 @@ static int write_midx_pack_names(struct hashfile *f, void *data)
>  	return 0;
>  }
>  
> +static int write_midx_disjoint_packs(struct hashfile *f, void *data)
> +{
> +	struct write_midx_context *ctx = data;
> +	size_t i;
> +
> +	for (i = 0; i < ctx->nr; i++) {
> +		struct pack_info *pack = &ctx->info[i];
> +		if (pack->expired)
> +			continue;
> +
> +		if (pack->bitmap_pos == BITMAP_POS_UNKNOWN && pack->bitmap_nr)
> +			BUG("pack '%s' has no bitmap position, but has %d bitmapped object(s)",
> +			    pack->pack_name, pack->bitmap_nr);
> +
> +		hashwrite_be32(f, pack->bitmap_pos);
> +		hashwrite_be32(f, pack->bitmap_nr);
> +		hashwrite_be32(f, !!pack->disjoint);
> +	}
> +	return 0;
> +}
> +
>  static int write_midx_oid_fanout(struct hashfile *f,
>  				 void *data)
>  {
> @@ -981,8 +1053,19 @@ static uint32_t *midx_pack_order(struct write_midx_context *ctx)
>  	QSORT(data, ctx->entries_nr, midx_pack_order_cmp);
>  
>  	ALLOC_ARRAY(pack_order, ctx->entries_nr);
> -	for (i = 0; i < ctx->entries_nr; i++)
> +	for (i = 0; i < ctx->entries_nr; i++) {
> +		struct pack_midx_entry *e = &ctx->entries[data[i].nr];
> +		struct pack_info *pack = &ctx->info[ctx->pack_perm[e->pack_int_id]];
> +		if (pack->bitmap_pos == BITMAP_POS_UNKNOWN)
> +			pack->bitmap_pos = i;
> +		pack->bitmap_nr++;
>  		pack_order[i] = data[i].nr;
> +	}
> +	for (i = 0; i < ctx->nr; i++) {
> +		struct pack_info *pack = &ctx->info[ctx->pack_perm[i]];
> +		if (pack->bitmap_pos == BITMAP_POS_UNKNOWN)
> +			pack->bitmap_pos = 0;
> +	}
>  	free(data);
>  
>  	trace2_region_leave("midx", "midx_pack_order", the_repository);
> @@ -1283,6 +1366,7 @@ static int write_midx_internal(const char *object_dir,
>  	struct hashfile *f = NULL;
>  	struct lock_file lk;
>  	struct write_midx_context ctx = { 0 };
> +	int pack_disjoint_concat_len = 0;
>  	int pack_name_concat_len = 0;
>  	int dropped_packs = 0;
>  	int result = 0;
> @@ -1495,8 +1579,10 @@ static int write_midx_internal(const char *object_dir,
>  	}
>  
>  	for (i = 0; i < ctx.nr; i++) {
> -		if (!ctx.info[i].expired)
> -			pack_name_concat_len += strlen(ctx.info[i].pack_name) + 1;
> +		if (ctx.info[i].expired)
> +			continue;
> +		pack_name_concat_len += strlen(ctx.info[i].pack_name) + 1;
> +		pack_disjoint_concat_len += 3 * sizeof(uint32_t);
>  	}
>  
>  	/* Check that the preferred pack wasn't expired (if given). */
> @@ -1556,6 +1642,8 @@ static int write_midx_internal(const char *object_dir,
>  		add_chunk(cf, MIDX_CHUNKID_REVINDEX,
>  			  st_mult(ctx.entries_nr, sizeof(uint32_t)),
>  			  write_midx_revindex);
> +		add_chunk(cf, MIDX_CHUNKID_DISJOINTPACKS,
> +			  pack_disjoint_concat_len, write_midx_disjoint_packs);
>  	}
>  
>  	write_midx_header(f, get_num_chunks(cf), ctx.nr - dropped_packs);
> diff --git a/midx.h b/midx.h
> index a5d98919c8..cdd16a8378 100644
> --- a/midx.h
> +++ b/midx.h
> @@ -7,6 +7,7 @@
>  struct object_id;
>  struct pack_entry;
>  struct repository;
> +struct bitmapped_pack;
>  
>  #define GIT_TEST_MULTI_PACK_INDEX "GIT_TEST_MULTI_PACK_INDEX"
>  #define GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP \
> @@ -33,6 +34,8 @@ struct multi_pack_index {
>  
>  	const unsigned char *chunk_pack_names;
>  	size_t chunk_pack_names_len;
> +	const uint32_t *chunk_disjoint_packs;
> +	size_t chunk_disjoint_packs_len;
>  	const uint32_t *chunk_oid_fanout;
>  	const unsigned char *chunk_oid_lookup;
>  	const unsigned char *chunk_object_offsets;
> @@ -58,6 +61,8 @@ void get_midx_rev_filename(struct strbuf *out, struct multi_pack_index *m);
>  
>  struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local);
>  int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, uint32_t pack_int_id);
> +int nth_bitmapped_pack(struct repository *r, struct multi_pack_index *m,
> +		       struct bitmapped_pack *bp, uint32_t pack_int_id);
>  int bsearch_midx(const struct object_id *oid, struct multi_pack_index *m, uint32_t *result);
>  off_t nth_midxed_offset(struct multi_pack_index *m, uint32_t pos);
>  uint32_t nth_midxed_pack_int_id(struct multi_pack_index *m, uint32_t pos);
> diff --git a/pack-bitmap.h b/pack-bitmap.h
> index 5273a6a019..b7fa1a42a9 100644
> --- a/pack-bitmap.h
> +++ b/pack-bitmap.h
> @@ -52,6 +52,15 @@ typedef int (*show_reachable_fn)(
>  
>  struct bitmap_index;
>  
> +struct bitmapped_pack {
> +	struct packed_git *p;
> +
> +	uint32_t bitmap_pos;
> +	uint32_t bitmap_nr;
> +
> +	unsigned disjoint : 1;
> +};
> +
>  struct bitmap_index *prepare_bitmap_git(struct repository *r);
>  struct bitmap_index *prepare_midx_bitmap_git(struct multi_pack_index *midx);
>  void count_bitmap_commit_list(struct bitmap_index *, uint32_t *commits,
> diff --git a/t/helper/test-read-midx.c b/t/helper/test-read-midx.c
> index e9a444ddba..4b44995dca 100644
> --- a/t/helper/test-read-midx.c
> +++ b/t/helper/test-read-midx.c
> @@ -100,10 +100,37 @@ static int read_midx_preferred_pack(const char *object_dir)
>  	return 0;
>  }
>  
> +static int read_midx_bitmapped_packs(const char *object_dir)
> +{
> +	struct multi_pack_index *midx = NULL;
> +	struct bitmapped_pack pack;
> +	uint32_t i;
> +
> +	setup_git_directory();
> +
> +	midx = load_multi_pack_index(object_dir, 1);
> +	if (!midx)
> +		return 1;
> +
> +	for (i = 0; i < midx->num_packs; i++) {
> +		if (nth_bitmapped_pack(the_repository, midx, &pack, i) < 0)
> +			return 1;
> +
> +		printf("%s\n", pack_basename(pack.p));
> +		printf("  bitmap_pos: %"PRIuMAX"\n", (uintmax_t)pack.bitmap_pos);
> +		printf("  bitmap_nr: %"PRIuMAX"\n", (uintmax_t)pack.bitmap_nr);
> +		printf("  disjoint: %s\n", pack.disjoint & 0x1 ? "yes" : "no");
> +	}
> +
> +	close_midx(midx);
> +
> +	return 0;
> +}
> +
>  int cmd__read_midx(int argc, const char **argv)
>  {
>  	if (!(argc == 2 || argc == 3))
> -		usage("read-midx [--show-objects|--checksum|--preferred-pack] <object-dir>");
> +		usage("read-midx [--show-objects|--checksum|--preferred-pack|--bitmap] <object-dir>");
>  
>  	if (!strcmp(argv[1], "--show-objects"))
>  		return read_midx_file(argv[2], 1);
> @@ -111,5 +138,7 @@ int cmd__read_midx(int argc, const char **argv)
>  		return read_midx_checksum(argv[2]);
>  	else if (!strcmp(argv[1], "--preferred-pack"))
>  		return read_midx_preferred_pack(argv[2]);
> +	else if (!strcmp(argv[1], "--bitmap"))
> +		return read_midx_bitmapped_packs(argv[2]);
>  	return read_midx_file(argv[1], 0);
>  }
> diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
> index c4c6060cee..fd24e0c952 100755
> --- a/t/t5319-multi-pack-index.sh
> +++ b/t/t5319-multi-pack-index.sh
> @@ -1157,4 +1157,62 @@ test_expect_success 'reader notices too-small revindex chunk' '
>  	test_cmp expect.err err
>  '
>  
> +test_expect_success 'disjoint packs are stored via the DISP chunk' '
> +	test_when_finished "rm -fr repo" &&
> +	git init repo &&
> +	(
> +		cd repo &&
> +
> +		for i in 1 2 3 4 5
> +		do
> +			test_commit "$i" &&
> +			git repack -d || return 1
> +		done &&
> +
> +		find $objdir/pack -type f -name "*.idx" | xargs -n 1 basename | sort >packs &&
> +
> +		git multi-pack-index write --stdin-packs <packs &&
> +		test_must_fail test-tool read-midx --bitmap $objdir 2>err &&
> +		cat >expect <<-\EOF &&
> +		error: MIDX does not contain the DISP chunk
> +		EOF
> +		test_cmp expect err &&
> +
> +		sed -e "s/^/+/g" packs >in &&
> +		git multi-pack-index write --stdin-packs --bitmap \
> +			--preferred-pack="$(head -n1 <packs)" <in &&
> +		test-tool read-midx --bitmap $objdir >actual &&
> +		for i in $(test_seq $(wc -l <packs))
> +		do
> +			sed -ne "${i}s/\.idx$/\.pack/p" packs &&
> +			echo "  bitmap_pos: $(( $(( $i - 1 )) * 3 ))" &&
> +			echo "  bitmap_nr: 3" &&
> +			echo "  disjoint: yes" || return 1
> +		done >expect &&
> +		test_cmp expect actual
> +	)
> +'
> +
> +test_expect_success 'non-disjoint packs are detected' '
> +	test_when_finished "rm -fr repo" &&
> +	git init repo &&
> +	(
> +		cd repo &&
> +
> +		test_commit base &&
> +		git repack -d &&
> +		test_commit other &&
> +		git repack -a &&
> +
> +		ls -la .git/objects/pack/ &&
> +
> +		find $objdir/pack -type f -name "*.idx" |
> +			sed -e "s/.*\/\(.*\)$/+\1/g" >in &&
> +
> +		test_must_fail git multi-pack-index write --stdin-packs \
> +			--bitmap <in 2>err &&
> +		grep "duplicate object.* among disjoint packs" err
> +	)
> +'
> +
>  test_done
> -- 
> 2.43.0.24.g980b318f98
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 07/24] midx: implement `--retain-disjoint` mode
  2023-11-28 19:08 ` [PATCH 07/24] midx: implement `--retain-disjoint` mode Taylor Blau
@ 2023-11-30 10:18   ` Patrick Steinhardt
  2023-11-30 19:29     ` Taylor Blau
  0 siblings, 1 reply; 107+ messages in thread
From: Patrick Steinhardt @ 2023-11-30 10:18 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Junio C Hamano

[-- Attachment #1: Type: text/plain, Size: 11318 bytes --]

On Tue, Nov 28, 2023 at 02:08:13PM -0500, Taylor Blau wrote:
> Once multi-pack reachability bitmaps learn how to perform pack reuse
> over the set of disjoint packs, we will want to teach `git repack` to
> evolve the set of disjoint packs over time.
> 
> To evolve the set of disjoint packs means any new packs made by `repack`
> should be disjoint with respect to the existing set of disjoint packs so
> as to be able to join that set when updating the multi-pack index.
> 
> The details of generating such packs will be left to future commits. But
> any new pack(s) created by repack as disjoint will be marked as such by
> passing them over `--stdin-packs` with the special '+' marker when
> generating a new MIDX.
> 
> This patch, however, addresses the question of how we retain the
> existing set of disjoint packs when updating the multi-pack index. One
> option would be for `repack` to keep track of the set of disjoint packs
> itself by querying the MIDX, and then adding the special '+' marker
> appropriately when generating the input for `--stdin-packs`.
> 
> But this is verbose and error-prone, since two different parts of Git
> would need to maintain the same notion of the set of disjoint packs.
> When one disagrees with the other, the set of so-called disjoint packs
> may actually contain two or more packs which have one or more object(s)
> in common, making the set non-disjoint.
> 
> Instead, introduce a `--retain-disjoint` mode for the `git
> multi-pack-index write` sub-command which keeps any packs which are:
> 
>   - marked as disjoint in the existing MIDX, and
> 
>   - not deleted (e.g., they are not excluded from the input for
>     `--stdin-packs`).
> 
> This will allow the `repack` command to not have to keep track of the
> set of currently-disjoint packs itself, reducing the number of lines of
> code necessary to implement this feature, and making the resulting
> implementation less error-prone.
> 
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
>  Documentation/git-multi-pack-index.txt |  8 +++
>  builtin/multi-pack-index.c             |  3 +
>  midx.c                                 | 49 +++++++++++++++
>  midx.h                                 |  1 +
>  t/lib-disjoint.sh                      | 38 ++++++++++++
>  t/t5319-multi-pack-index.sh            | 82 ++++++++++++++++++++++++++
>  6 files changed, 181 insertions(+)
>  create mode 100644 t/lib-disjoint.sh
> 
> diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
> index d130e65b28..ac0c7b124b 100644
> --- a/Documentation/git-multi-pack-index.txt
> +++ b/Documentation/git-multi-pack-index.txt
> @@ -54,6 +54,14 @@ write::
>  		"disjoint". See the "`DISP` chunk and disjoint packs"
>  		section in linkgit:gitformat-pack[5] for more.
>  
> +	--retain-disjoint::
> +		When writing a multi-pack index with a reachability
> +		bitmap, keep any packs marked as disjoint in the
> +		existing MIDX (if any) as such in the new MIDX. Existing
> +		disjoint packs which are removed (e.g., not listed via
> +		`--stdin-packs`) are ignored. This option works in
> +		addition to the '+' marker for `--stdin-packs`.

I'm trying to understand when you're expected to pass this flag and when
you're expected not to pass it. This documentation could also help in
the documentation here so that the user can make a more informed
decision.

Patrick

>  	--refs-snapshot=<path>::
>  		With `--bitmap`, optionally specify a file which
>  		contains a "refs snapshot" taken prior to repacking.
> diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
> index 0f1dd4651d..dcfabf2626 100644
> --- a/builtin/multi-pack-index.c
> +++ b/builtin/multi-pack-index.c
> @@ -138,6 +138,9 @@ static int cmd_multi_pack_index_write(int argc, const char **argv,
>  			 N_("write multi-pack index containing only given indexes")),
>  		OPT_FILENAME(0, "refs-snapshot", &opts.refs_snapshot,
>  			     N_("refs snapshot for selecting bitmap commits")),
> +		OPT_BIT(0, "retain-disjoint", &opts.flags,
> +			N_("retain non-deleted disjoint packs"),
> +			MIDX_WRITE_RETAIN_DISJOINT),
>  		OPT_END(),
>  	};
>  
> diff --git a/midx.c b/midx.c
> index 65ba0c70fe..ce67da9f85 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -721,6 +721,12 @@ static void midx_fanout_add_midx_fanout(struct midx_fanout *fanout,
>  					   &fanout->entries[fanout->nr],
>  					   cur_object);
>  		fanout->entries[fanout->nr].preferred = 0;
> +		/*
> +		 * It's OK to set disjoint to 0 here, even with
> +		 * `--retain-disjoint`, since we will always see the disjoint
> +		 * copy of some object below in get_sorted_entries(), causing us
> +		 * to die().
> +		 */
>  		fanout->entries[fanout->nr].disjoint = 0;
>  		fanout->nr++;
>  	}
> @@ -1362,6 +1368,37 @@ static struct multi_pack_index *lookup_multi_pack_index(struct repository *r,
>  	return result;
>  }
>  
> +static int midx_retain_existing_disjoint(struct repository *r,
> +					 struct multi_pack_index *from,
> +					 struct write_midx_context *ctx)
> +{
> +	struct bitmapped_pack bp;
> +	uint32_t i, midx_pos;
> +
> +	for (i = 0; i < ctx->nr; i++) {
> +		struct pack_info *info = &ctx->info[i];
> +		/*
> +		 * Having to call `midx_locate_pack()` in a loop is
> +		 * sub-optimal, since it is O(n*log(n)) in the number
> +		 * of packs.
> +		 *
> +		 * When reusing an existing MIDX, we know that the first
> +		 * 'n' packs appear in the same order, so we could avoid
> +		 * this when reusing an existing MIDX. But we may be
> +		 * instead relying on the order given to us by
> +		 * for_each_file_in_pack_dir(), in which case we can't
> +		 * make any such guarantees.
> +		 */
> +		if (!midx_locate_pack(from, info->pack_name, &midx_pos))
> +			continue;
> +
> +		if (nth_bitmapped_pack(r, from, &bp, midx_pos) < 0)
> +			return -1;
> +		info->disjoint = bp.disjoint;
> +	}
> +	return 0;
> +}
> +
>  static int write_midx_internal(const char *object_dir,
>  			       struct string_list *packs_to_include,
>  			       struct string_list *packs_to_drop,
> @@ -1444,6 +1481,18 @@ static int write_midx_internal(const char *object_dir,
>  	for_each_file_in_pack_dir(object_dir, add_pack_to_midx, &ctx);
>  	stop_progress(&ctx.progress);
>  
> +	if (flags & MIDX_WRITE_RETAIN_DISJOINT) {
> +		struct multi_pack_index *m = ctx.m;
> +		if (!m)
> +			m = lookup_multi_pack_index(the_repository, object_dir);
> +
> +		if (m) {
> +			result = midx_retain_existing_disjoint(the_repository, m, &ctx);
> +			if (result)
> +				goto cleanup;
> +		}
> +	}
> +
>  	if ((ctx.m && ctx.nr == ctx.m->num_packs) &&
>  	    !(packs_to_include || packs_to_drop)) {
>  		struct bitmap_index *bitmap_git;
> diff --git a/midx.h b/midx.h
> index a6e969c2ea..d7ce52ff7b 100644
> --- a/midx.h
> +++ b/midx.h
> @@ -54,6 +54,7 @@ struct multi_pack_index {
>  #define MIDX_WRITE_BITMAP (1 << 2)
>  #define MIDX_WRITE_BITMAP_HASH_CACHE (1 << 3)
>  #define MIDX_WRITE_BITMAP_LOOKUP_TABLE (1 << 4)
> +#define MIDX_WRITE_RETAIN_DISJOINT (1 << 5)
>  
>  const unsigned char *get_midx_checksum(struct multi_pack_index *m);
>  void get_midx_filename(struct strbuf *out, const char *object_dir);
> diff --git a/t/lib-disjoint.sh b/t/lib-disjoint.sh
> new file mode 100644
> index 0000000000..c6c6e74aba
> --- /dev/null
> +++ b/t/lib-disjoint.sh
> @@ -0,0 +1,38 @@
> +# Helpers for scripts testing disjoint packs; see t5319 for example usage.
> +
> +objdir=.git/objects
> +
> +test_disjoint_1 () {
> +	local pack="$1"
> +	local want="$2"
> +
> +	test-tool read-midx --bitmap $objdir >out &&
> +	grep -A 3 "$pack" out >found &&
> +
> +	if ! test -s found
> +	then
> +		echo >&2 "could not find '$pack' in MIDX"
> +		return 1
> +	fi
> +
> +	if ! grep -q "disjoint: $want" found
> +	then
> +		echo >&2 "incorrect disjoint state for pack '$pack'"
> +		return 1
> +	fi
> +	return 0
> +}
> +
> +# test_must_be_disjoint <pack-$XYZ.pack>
> +#
> +# Ensures that the given pack is marked as disjoint.
> +test_must_be_disjoint () {
> +	test_disjoint_1 "$1" "yes"
> +}
> +
> +# test_must_not_be_disjoint <pack-$XYZ.pack>
> +#
> +# Ensures that the given pack is not marked as disjoint.
> +test_must_not_be_disjoint () {
> +	test_disjoint_1 "$1" "no"
> +}
> diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
> index fd24e0c952..02cfddf151 100755
> --- a/t/t5319-multi-pack-index.sh
> +++ b/t/t5319-multi-pack-index.sh
> @@ -3,6 +3,7 @@
>  test_description='multi-pack-indexes'
>  . ./test-lib.sh
>  . "$TEST_DIRECTORY"/lib-chunk.sh
> +. "$TEST_DIRECTORY"/lib-disjoint.sh
>  
>  GIT_TEST_MULTI_PACK_INDEX=0
>  objdir=.git/objects
> @@ -1215,4 +1216,85 @@ test_expect_success 'non-disjoint packs are detected' '
>  	)
>  '
>  
> +test_expect_success 'retain disjoint packs while writing' '
> +	test_when_finished "rm -fr repo" &&
> +	git init repo &&
> +	(
> +		cd repo &&
> +
> +		for i in 1 2
> +		do
> +			test_commit "$i" && git repack -d || return 1
> +		done &&
> +
> +		find $objdir/pack -type f -name "pack-*.idx" |
> +		sed -e "s/^.*\/\(.*\)/\1/g" | sort >packs.old &&
> +
> +		test_line_count = 2 packs.old &&
> +		disjoint="$(head -n 1 packs.old)" &&
> +		non_disjoint="$(tail -n 1 packs.old)" &&
> +
> +		cat >in <<-EOF &&
> +		+$disjoint
> +		$non_disjoint
> +		EOF
> +		git multi-pack-index write --stdin-packs --bitmap <in &&
> +
> +		test_must_be_disjoint "${disjoint%.idx}.pack" &&
> +		test_must_not_be_disjoint "${non_disjoint%.idx}.pack" &&
> +
> +		test_commit 3 &&
> +		git repack -d &&
> +
> +		find $objdir/pack -type f -name "pack-*.idx" |
> +		sed -e "s/^.*\/\(.*\)/\1/g" | sort >packs.new &&
> +
> +		new_disjoint="$(comm -13 packs.old packs.new)" &&
> +		cat >in <<-EOF &&
> +		$disjoint
> +		$non_disjoint
> +		+$new_disjoint
> +		EOF
> +		git multi-pack-index write --stdin-packs --bitmap \
> +			--retain-disjoint <in &&
> +
> +		test_must_be_disjoint "${disjoint%.idx}.pack" &&
> +		test_must_be_disjoint "${new_disjoint%.idx}.pack" &&
> +		test_must_not_be_disjoint "${non_disjoint%.idx}.pack"
> +
> +	)
> +'
> +
> +test_expect_success 'non-disjoint packs are detected via --retain-disjoint' '
> +	test_when_finished "rm -fr repo" &&
> +	git init repo &&
> +	(
> +		cd repo &&
> +		packdir=.git/objects/pack &&
> +
> +		test_commit base &&
> +		base="$(echo base | git pack-objects --revs $packdir/pack)" &&
> +
> +		cat >in <<-EOF &&
> +		+pack-$base.idx
> +		EOF
> +		git multi-pack-index write --stdin-packs --bitmap <in &&
> +
> +		test_must_be_disjoint "pack-$base.pack" &&
> +
> +		test_commit other &&
> +		other="$(echo other | git pack-objects --revs $packdir/pack)" &&
> +
> +		cat >in <<-EOF &&
> +		pack-$base.idx
> +		+pack-$other.idx
> +		EOF
> +		test_must_fail git multi-pack-index write --stdin-packs --retain-disjoint --bitmap <in 2>err &&
> +		grep "duplicate object.* among disjoint packs" err &&
> +
> +		test_must_fail git multi-pack-index write --retain-disjoint --bitmap 2>err &&
> +		grep "duplicate object.* among disjoint packs" err
> +	)
> +'
> +
>  test_done
> -- 
> 2.43.0.24.g980b318f98
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 08/24] pack-objects: implement `--ignore-disjoint` mode
  2023-11-28 19:08 ` [PATCH 08/24] pack-objects: implement `--ignore-disjoint` mode Taylor Blau
@ 2023-11-30 10:18   ` Patrick Steinhardt
  2023-11-30 19:32     ` Taylor Blau
  0 siblings, 1 reply; 107+ messages in thread
From: Patrick Steinhardt @ 2023-11-30 10:18 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Junio C Hamano

[-- Attachment #1: Type: text/plain, Size: 15023 bytes --]

On Tue, Nov 28, 2023 at 02:08:16PM -0500, Taylor Blau wrote:
> Before multi-pack reachability bitmaps learn how to perform pack reuse
> over the set of disjoint packs, we will need a way to generate packs
> that are known to be disjoint with respect to the currently marked set
> of disjoint packs.
> 
> In other words, we want a way to make a pack which does not have any
> objects contained in the union of the set of packs which are currently
> marked as disjoint.
> 
> There are a various ways that we could go about this, for example:
> 
>   - passing `--unpacked`, which would exclude all packed objects (and
>     thus would not contain any objects from the disjoint pack)
> 
>   - passing `--stdin-packs` with the set of packs currently marked as
>     disjoint as "excluded", indicating that `pack-objects` should
>     discard any objects present in any of the excluded packs (thus
>     producing a disjoint pack)
> 
>   - marking each of the disjoint packs as kept in-core with the
>     `--keep-pack` flag, and then passing `--honor-pack-keep` to
>     similarly ignore any object(s) from kept packs (thus also producing
>     a pack which is disjoint with respect to the current set)
> 
> `git repack` is the main entry-point to generating a new pack, by
> invoking `pack-objects` and then adding the new pack to the set of
> disjoint packs if generating a new MIDX. However, `repack` has a number
> of ways to invoke `pack-objects` (e.g., all-into-one repacks, geometric
> repacks, incremental repacks, etc.), all of which would require careful
> reasoning in order to prove that the resulting set of packs is disjoint.
> 
> The most appealing option of the above would be to pass the set of
> disjoint packs as kept (via `--keep-pack`) and then ignore their
> contents (with `--honor-pack-keep`), doing so for all kinds of
> `pack-objects` invocations. But there may be more disjoint packs than we
> can easily fit into the command-line arguments.
> 
> Instead, teach `pack-objects` a special `--ignore-disjoint` which is the
> moral equivalent of marking the set of disjoint packs as kept, and
> ignoring their contents, even if it would have otherwise been packed. In
> fact, this similarity extends down to the implementation, where each
> disjoint pack is first loaded, then has its `pack_keep_in_core` bit set.
> 
> With this in place, we can use the kept-pack cache from 20b031fede
> (packfile: add kept-pack cache for find_kept_pack_entry(), 2021-02-22),
> which looks up objects first in a cache containing just the set of kept
> (in this case, disjoint) packs. Assuming that the set of disjoint packs
> is a relatively small portion of the entire repository (which should be
> a safe assumption to make), each object lookup will be very inexpensive.

This cought me by surprise a bit. I'd have expected that in the end,
most of the packfiles in a repository would be disjoint. Using for
example geometric repacks, my expectation was that all of the packs that
get written via geometric repacking would eventually become disjoint
whereas new packs added to the repository would initially not be.

Patrick

> The only place we want to avoid using `--ignore-disjoint` is in
> conjunction with `--cruft`, since doing so may cause us to omit an
> object which would have been included in a new cruft pack in order to
> freshen it. In other words, failing to do so might cause that object to
> be pruned from the repository earlier than expected.
> 
> Otherwise, `--ignore-disjoint` is compatible with most other modes of
> `pack-objects`. These various combinations are tested below. As a
> result, `repack` will be able to unconditionally (except for the cruft
> pack) pass `--ignore-disjoint` when trying to add a new pack to the
> disjoint set, and the result will be usable, without having to carefully
> consider and reason about each individual case.
> 
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
>  Documentation/git-pack-objects.txt |   8 ++
>  builtin/pack-objects.c             |  31 +++++-
>  t/lib-disjoint.sh                  |  11 ++
>  t/t5331-pack-objects-stdin.sh      | 156 +++++++++++++++++++++++++++++
>  4 files changed, 203 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
> index e32404c6aa..592c4ce742 100644
> --- a/Documentation/git-pack-objects.txt
> +++ b/Documentation/git-pack-objects.txt
> @@ -96,6 +96,14 @@ base-name::
>  Incompatible with `--revs`, or options that imply `--revs` (such as
>  `--all`), with the exception of `--unpacked`, which is compatible.
>  
> +--ignore-disjoint::
> +	This flag causes an object that appears in any pack marked as
> +	"disjoint" by the multi-pack index to be ignored, even if it
> +	would have otherwise been packed. When used with
> +	`--stdin-packs`, objects from disjoint packs may be included if
> +	and only if a disjoint pack is explicitly given as an input pack
> +	to `--stdin-packs`. Incompatible with `--cruft`.
> +
>  --cruft::
>  	Packs unreachable objects into a separate "cruft" pack, denoted
>  	by the existence of a `.mtimes` file. Typically used by `git
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index bfa60359d4..107154db34 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -207,6 +207,7 @@ static int have_non_local_packs;
>  static int incremental;
>  static int ignore_packed_keep_on_disk;
>  static int ignore_packed_keep_in_core;
> +static int ignore_midx_disjoint_packs;
>  static int allow_ofs_delta;
>  static struct pack_idx_option pack_idx_opts;
>  static const char *base_name;
> @@ -1403,7 +1404,8 @@ static int want_found_object(const struct object_id *oid, int exclude,
>  	/*
>  	 * Then handle .keep first, as we have a fast(er) path there.
>  	 */
> -	if (ignore_packed_keep_on_disk || ignore_packed_keep_in_core) {
> +	if (ignore_packed_keep_on_disk || ignore_packed_keep_in_core ||
> +	    ignore_midx_disjoint_packs) {
>  		/*
>  		 * Set the flags for the kept-pack cache to be the ones we want
>  		 * to ignore.
> @@ -1415,7 +1417,7 @@ static int want_found_object(const struct object_id *oid, int exclude,
>  		unsigned flags = 0;
>  		if (ignore_packed_keep_on_disk)
>  			flags |= ON_DISK_KEEP_PACKS;
> -		if (ignore_packed_keep_in_core)
> +		if (ignore_packed_keep_in_core || ignore_midx_disjoint_packs)
>  			flags |= IN_CORE_KEEP_PACKS;
>  
>  		if (ignore_packed_keep_on_disk && p->pack_keep)
> @@ -3389,6 +3391,7 @@ static void read_packs_list_from_stdin(void)
>  			die(_("could not find pack '%s'"), item->string);
>  		if (!is_pack_valid(p))
>  			die(_("packfile %s cannot be accessed"), p->pack_name);
> +		p->pack_keep_in_core = 0;
>  	}
>  
>  	/*
> @@ -4266,6 +4269,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
>  			 N_("create packs suitable for shallow fetches")),
>  		OPT_BOOL(0, "honor-pack-keep", &ignore_packed_keep_on_disk,
>  			 N_("ignore packs that have companion .keep file")),
> +		OPT_BOOL(0, "ignore-disjoint", &ignore_midx_disjoint_packs,
> +			 N_("ignore packs that are marked disjoint in the MIDX")),
>  		OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"),
>  				N_("ignore this pack")),
>  		OPT_INTEGER(0, "compression", &pack_compression_level,
> @@ -4412,7 +4417,9 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
>  		if (use_internal_rev_list)
>  			die(_("cannot use internal rev list with --cruft"));
>  		if (stdin_packs)
> -			die(_("cannot use --stdin-packs with --cruft"));
> +			die(_("cannot use %s with %s"), "--stdin-packs", "--cruft");
> +		if (ignore_midx_disjoint_packs)
> +			die(_("cannot use %s with %s"), "--ignore-disjoint", "--cruft");
>  	}
>  
>  	/*
> @@ -4452,6 +4459,24 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
>  		if (!p) /* no keep-able packs found */
>  			ignore_packed_keep_on_disk = 0;
>  	}
> +	if (ignore_midx_disjoint_packs) {
> +		struct multi_pack_index *m = get_multi_pack_index(the_repository);
> +		struct bitmapped_pack pack;
> +		unsigned any_disjoint = 0;
> +		uint32_t i;
> +
> +		for (i = 0; m && m->chunk_disjoint_packs && i < m->num_packs; i++) {
> +			if (nth_bitmapped_pack(the_repository, m, &pack, i) < 0)
> +				die(_("could not load bitmapped pack %i"), i);
> +			if (pack.disjoint) {
> +				pack.p->pack_keep_in_core = 1;
> +				any_disjoint = 1;
> +			}
> +		}
> +
> +		if (!any_disjoint) /* no disjoint packs to ignore */
> +			ignore_midx_disjoint_packs = 0;
> +	}
>  	if (local) {
>  		/*
>  		 * unlike ignore_packed_keep_on_disk above, we do not
> diff --git a/t/lib-disjoint.sh b/t/lib-disjoint.sh
> index c6c6e74aba..c802ca6940 100644
> --- a/t/lib-disjoint.sh
> +++ b/t/lib-disjoint.sh
> @@ -36,3 +36,14 @@ test_must_be_disjoint () {
>  test_must_not_be_disjoint () {
>  	test_disjoint_1 "$1" "no"
>  }
> +
> +# packed_contents </path/to/pack-$XYZ.idx [...]>
> +#
> +# Prints the set of objects packed in the given pack indexes.
> +packed_contents () {
> +	for idx in "$@"
> +	do
> +		git show-index <$idx || return 1
> +	done >tmp &&
> +	cut -d" " -f2 <tmp | sort -u
> +}
> diff --git a/t/t5331-pack-objects-stdin.sh b/t/t5331-pack-objects-stdin.sh
> index 2dcf1eecee..e522aa3f7d 100755
> --- a/t/t5331-pack-objects-stdin.sh
> +++ b/t/t5331-pack-objects-stdin.sh
> @@ -6,6 +6,7 @@ export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
>  
>  TEST_PASSES_SANITIZE_LEAK=true
>  . ./test-lib.sh
> +. "$TEST_DIRECTORY"/lib-disjoint.sh
>  
>  packed_objects () {
>  	git show-index <"$1" >tmp-object-list &&
> @@ -237,4 +238,159 @@ test_expect_success 'pack-objects --stdin with packfiles from main and alternate
>  	test_cmp expected-objects actual-objects
>  '
>  
> +objdir=.git/objects
> +packdir=$objdir/pack
> +
> +test_expect_success 'loose objects also in disjoint packs are ignored' '
> +	test_when_finished "rm -fr repo" &&
> +	git init repo &&
> +	(
> +		cd repo &&
> +
> +		# create a pack containing the objects in each commit below, but
> +		# do not delete their loose copies
> +		test_commit base &&
> +		base_pack="$(echo base | git pack-objects --revs $packdir/pack)" &&
> +
> +		test_commit other &&
> +		other_pack="$(echo base..other | git pack-objects --revs $packdir/pack)" &&
> +
> +		cat >in <<-EOF &&
> +		pack-$base_pack.idx
> +		+pack-$other_pack.idx
> +		EOF
> +		git multi-pack-index write --stdin-packs --bitmap <in &&
> +
> +		test_commit more &&
> +		out="$(git pack-objects --all --ignore-disjoint $packdir/pack)" &&
> +
> +		# gather all objects in "all", and objects from the disjoint
> +		# pack in "disjoint"
> +		git cat-file --batch-all-objects --batch-check="%(objectname)" >all &&
> +		packed_contents "$packdir/pack-$other_pack.idx" >disjoint &&
> +
> +		# make sure that the set of objects we just generated matches
> +		# "all \ disjoint"
> +		packed_contents "$packdir/pack-$out.idx" >got &&
> +		comm -23 all disjoint >want &&
> +		test_cmp want got
> +	)
> +'
> +
> +test_expect_success 'objects in disjoint packs are ignored (--unpacked)' '
> +	test_when_finished "rm -fr repo" &&
> +	git init repo &&
> +	(
> +		cd repo &&
> +
> +		for c in A B
> +		do
> +			test_commit "$c" || return 1
> +		done &&
> +
> +		A="$(echo "A" | git pack-objects --revs $packdir/pack)" &&
> +		B="$(echo "A..B" | git pack-objects --revs $packdir/pack)" &&
> +
> +		cat >in <<-EOF &&
> +		pack-$A.idx
> +		+pack-$B.idx
> +		EOF
> +		git multi-pack-index write --stdin-packs --bitmap <in &&
> +
> +		test_must_not_be_disjoint "pack-$A.pack" &&
> +		test_must_be_disjoint "pack-$B.pack" &&
> +
> +		test_commit C &&
> +
> +		got="$(git pack-objects --all --unpacked --ignore-disjoint $packdir/pack)" &&
> +		packed_contents "$packdir/pack-$got.idx" >actual &&
> +
> +		git rev-list --objects --no-object-names B..C >expect.raw &&
> +		sort <expect.raw >expect &&
> +
> +		test_cmp expect actual
> +	)
> +'
> +
> +test_expect_success 'objects in disjoint packs are ignored (--stdin-packs)' '
> +	# Create objects in three separate packs:
> +	#
> +	#   - pack A (midx, non disjoint)
> +	#   - pack B (midx, disjoint)
> +	#   - pack C (non-midx)
> +	#
> +	# Then create a new pack with `--stdin-packs` and `--ignore-disjoint`
> +	# including packs A, B, and C. The resulting pack should contain
> +	# only the objects from packs A, and C, excluding those from
> +	# pack B as it is marked as disjoint.
> +	test_when_finished "rm -fr repo" &&
> +	git init repo &&
> +	(
> +		cd repo &&
> +
> +		for c in A B C
> +		do
> +			test_commit "$c" || return 1
> +		done &&
> +
> +		A="$(echo "A" | git pack-objects --revs $packdir/pack)" &&
> +		B="$(echo "A..B" | git pack-objects --revs $packdir/pack)" &&
> +		C="$(echo "B..C" | git pack-objects --revs $packdir/pack)" &&
> +
> +		cat >in <<-EOF &&
> +		pack-$A.idx
> +		+pack-$B.idx
> +		EOF
> +		git multi-pack-index write --stdin-packs --bitmap <in &&
> +
> +		test_must_not_be_disjoint "pack-$A.pack" &&
> +		test_must_be_disjoint "pack-$B.pack" &&
> +
> +		# Generate a pack with `--stdin-packs` using packs "A" and "C",
> +		# but excluding objects from "B". The objects from pack "B" are
> +		# expected to be omitted from the generated pack for two
> +		# reasons:
> +		#
> +		#   - because it was specified as a negated tip via
> +		#     `--stdin-packs`
> +		#   - because it is a disjoint pack.
> +		cat >in <<-EOF &&
> +		pack-$A.pack
> +		^pack-$B.pack
> +		pack-$C.pack
> +		EOF
> +		got="$(git pack-objects --stdin-packs --ignore-disjoint $packdir/pack <in)" &&
> +
> +		packed_contents "$packdir/pack-$got.idx" >actual &&
> +		packed_contents "$packdir/pack-$A.idx" \
> +				"$packdir/pack-$C.idx" >expect &&
> +		test_cmp expect actual &&
> +
> +		# Generate another pack with `--stdin-packs`, this time
> +		# using packs "B" and "C". The objects from pack "B" are
> +		# expected to be in the final pack, despite it being a
> +		# disjoint pack, because "B" was mentioned explicitly
> +		# via `stdin-packs`.
> +		cat >in <<-EOF &&
> +		pack-$B.pack
> +		pack-$C.pack
> +		EOF
> +		got="$(git pack-objects --stdin-packs --ignore-disjoint $packdir/pack <in)" &&
> +
> +		packed_contents "$packdir/pack-$got.idx" >actual &&
> +		packed_contents "$packdir/pack-$B.idx" \
> +				"$packdir/pack-$C.idx" >expect &&
> +		test_cmp expect actual
> +	)
> +'
> +
> +test_expect_success '--cruft is incompatible with --ignore-disjoint' '
> +	test_must_fail git pack-objects --cruft --ignore-disjoint --stdout \
> +		</dev/null >/dev/null 2>actual &&
> +	cat >expect <<-\EOF &&
> +	fatal: cannot use --ignore-disjoint with --cruft
> +	EOF
> +	test_cmp expect actual
> +'
> +
>  test_done
> -- 
> 2.43.0.24.g980b318f98
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 01/24] pack-objects: free packing_data in more places
  2023-11-30 10:18   ` Patrick Steinhardt
@ 2023-11-30 19:08     ` Taylor Blau
  0 siblings, 0 replies; 107+ messages in thread
From: Taylor Blau @ 2023-11-30 19:08 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Jeff King, Junio C Hamano

On Thu, Nov 30, 2023 at 11:18:26AM +0100, Patrick Steinhardt wrote:
> > diff --git a/pack-objects.c b/pack-objects.c
> > index f403ca6986..1c7bedcc94 100644
> > --- a/pack-objects.c
> > +++ b/pack-objects.c
> > @@ -151,6 +151,21 @@ void prepare_packing_data(struct repository *r, struct packing_data *pdata)
> >  	init_recursive_mutex(&pdata->odb_lock);
> >  }
> >
> > +void free_packing_data(struct packing_data *pdata)
>
> Nit: shouldn't this rather be called `clear_packing_data`? `free` to me
> indicates that the data structure itself will be free'd, as well, which
> is not the case.

Thanks, that's a good suggestion. I've made the changes locally and will
include it in the subsequent round(s) ;-).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 02/24] pack-bitmap-write: deep-clear the `bb_commit` slab
  2023-11-30 10:18   ` Patrick Steinhardt
@ 2023-11-30 19:11     ` Taylor Blau
  0 siblings, 0 replies; 107+ messages in thread
From: Taylor Blau @ 2023-11-30 19:11 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Jeff King, Junio C Hamano

On Thu, Nov 30, 2023 at 11:18:31AM +0100, Patrick Steinhardt wrote:
> > diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c
> > index f4ecdf8b0e..dd3a415b9d 100644
> > --- a/pack-bitmap-write.c
> > +++ b/pack-bitmap-write.c
> > @@ -198,6 +198,13 @@ struct bb_commit {
> >  	unsigned idx; /* within selected array */
> >  };
> >
> > +static void clear_bb_commit(struct bb_commit *commit)
> > +{
> > +	free(commit->reverse_edges);
>
> I'd have expected to see `free_commit_list()` here instead of a simple
> free. Is there any reason why we don't use it?

Thanks for spotting an oversight on my part. We should definitely be
using free_commit_list() here instead of a bare free() to avoid leaking
the tail.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 04/24] midx: factor out `fill_pack_info()`
  2023-11-30 10:18   ` Patrick Steinhardt
@ 2023-11-30 19:19     ` Taylor Blau
  0 siblings, 0 replies; 107+ messages in thread
From: Taylor Blau @ 2023-11-30 19:19 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Jeff King, Junio C Hamano

On Thu, Nov 30, 2023 at 11:18:37AM +0100, Patrick Steinhardt wrote:
> On Tue, Nov 28, 2023 at 02:08:05PM -0500, Taylor Blau wrote:
> > When selecting which packfiles will be written while generating a MIDX,
> > the MIDX internals fill out a 'struct pack_info' with various pieces of
> > book-keeping.
> >
> > Instead of filling out each field of the `pack_info` structure
> > individually in each of the two spots that modify the array of such
> > structures (`ctx->info`), extract a common routine that does this for
> > us.
> >
> > This reduces the code duplication by a modest amount. But more
> > importantly, it zero-initializes the structure before assigning values
> > into it. This hardens us for a future change which will add additional
> > fields to this structure which (until this patch) was not
> > zero-initialized.
> >
> > As a result, any new fields added to the `pack_info` structure need only
> > be updated in a single location, instead of at each spot within midx.c.
> >
> > There are no functional changes in this patch.
> >
> > Signed-off-by: Taylor Blau <me@ttaylorr.com>
> > ---
> >  midx.c | 35 +++++++++++++++++++----------------
> >  1 file changed, 19 insertions(+), 16 deletions(-)
> >
> > diff --git a/midx.c b/midx.c
> > index 3b727dc633..591b3c636e 100644
> > --- a/midx.c
> > +++ b/midx.c
> > @@ -464,6 +464,17 @@ struct pack_info {
> >  	unsigned expired : 1;
> >  };
> >
> > +static void fill_pack_info(struct pack_info *info,
> > +			   struct packed_git *p, char *pack_name,
> > +			   uint32_t orig_pack_int_id)
> > +{
> > +	memset(info, 0, sizeof(struct pack_info));
> > +
> > +	info->orig_pack_int_id = orig_pack_int_id;
> > +	info->pack_name = pack_name;
> > +	info->p = p;
> > +}
>
> Nit: all callers manually call `xstrdup(pack_name)` and pass that to
> `fill_pack_info()`. We could consider doing this in here instead so that
> ownership of the string becomes a tad clearer.

That's a great idea. I think we'd also want to mark the pack_name
argument as const, not just because xstrdup() requires it, but also
because it communicates the ownership more clearly.

I'll squash something like this in:

--- >8 ---
diff --git a/midx.c b/midx.c
index b8b3f41024..6fb5e237b7 100644
--- a/midx.c
+++ b/midx.c
@@ -465,13 +465,13 @@ struct pack_info {
 };

 static void fill_pack_info(struct pack_info *info,
-			   struct packed_git *p, char *pack_name,
+			   struct packed_git *p, const char *pack_name,
 			   uint32_t orig_pack_int_id)
 {
 	memset(info, 0, sizeof(struct pack_info));

 	info->orig_pack_int_id = orig_pack_int_id;
-	info->pack_name = pack_name;
+	info->pack_name = xstrdup(pack_name);
 	info->p = p;
 }

@@ -557,8 +557,7 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
 			return;
 		}

-		fill_pack_info(&ctx->info[ctx->nr], p, xstrdup(file_name),
-			       ctx->nr);
+		fill_pack_info(&ctx->info[ctx->nr], p, file_name, ctx->nr);
 		ctx->nr++;
 	}
 }
@@ -1336,7 +1335,7 @@ static int write_midx_internal(const char *object_dir,
 			}

 			fill_pack_info(&ctx.info[ctx.nr++], ctx.m->packs[i],
-				       xstrdup(ctx.m->pack_names[i]), i);
+				       ctx.m->pack_names[i], i);
 		}
 	}
--- 8< ---

> > -		if (open_pack_index(ctx->info[ctx->nr].p)) {
> > +		if (open_pack_index(p)) {
> >  			warning(_("failed to open pack-index '%s'"),
> >  				full_path);
> >  			close_pack(ctx->info[ctx->nr].p);
>
> Isn't `ctx->info[ctx->nr].p` still uninitialized at this point?

Great catch, thank you!

> > @@ -1330,10 +1333,10 @@ static int write_midx_internal(const char *object_dir,
> >  				if (open_pack_index(ctx.m->packs[i]))
> >  					die(_("could not open index for %s"),
> >  					    ctx.m->packs[i]->pack_name);
> > -				ctx.info[ctx.nr].p = ctx.m->packs[i];
>
> Just to make sure I'm not missing anything, but this assignment here was
> basically redundant before this patch already, right?

I think that's right, but in either case we're assigning the pack once
at the end of each loop iteration via a single call to fill_pack_info().
Since we're using ctx.m->packs[i] in both places (after a call to
prepare_midx_pack()), we should be OK here.

Thanks,
Taylor

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* Re: [PATCH 05/24] midx: implement `DISP` chunk
  2023-11-30 10:18   ` Patrick Steinhardt
@ 2023-11-30 19:27     ` Taylor Blau
  0 siblings, 0 replies; 107+ messages in thread
From: Taylor Blau @ 2023-11-30 19:27 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Jeff King, Junio C Hamano

On Thu, Nov 30, 2023 at 11:18:45AM +0100, Patrick Steinhardt wrote:
> > diff --git a/Documentation/gitformat-pack.txt b/Documentation/gitformat-pack.txt
> > index 9fcb29a9c8..658682ddd5 100644
> > --- a/Documentation/gitformat-pack.txt
> > +++ b/Documentation/gitformat-pack.txt
> > @@ -396,6 +396,22 @@ CHUNK DATA:
> >  	    is padded at the end with between 0 and 3 NUL bytes to make the
> >  	    chunk size a multiple of 4 bytes.
> >
> > +	Disjoint Packfiles (ID: {'D', 'I', 'S', 'P'})
> > +	    Stores a table of three 4-byte unsigned integers in network order.
> > +	    Each table entry corresponds to a single pack (in the order that
> > +	    they appear above in the `PNAM` chunk). The values for each table
> > +	    entry are as follows:
> > +	    - The first bit position (in psuedo-pack order, see below) to
>
> s/psuedo/pseudo/

Good catch, thanks. Not sure how that escaped my spell-checker...

> > +=== `DISP` chunk and disjoint packs
> > +
> > +The Disjoint Packfiles (`DISP`) chunk encodes additional information
> > +about the objects in the multi-pack index's reachability bitmap. Recall
> > +that objects from the MIDX are arranged in "pseudo-pack" order (see:
>
> The colon feels a bit out-of-place here, so: s/see:/see/

Thanks, I'll fix that up.

> > +above) for reachability bitmaps.
> > +
> > +From the example above, suppose we have packs "a", "b", and "c", with
> > +10, 15, and 20 objects, respectively. In pseudo-pack order, those would
> > +be arranged as follows:
> > +
> > +    |a,0|a,1|...|a,9|b,0|b,1|...|b,14|c,0|c,1|...|c,19|
> > +
> > +When working with single-pack bitmaps (or, equivalently, multi-pack
> > +reachability bitmaps without any packs marked as disjoint),
> > +linkgit:git-pack-objects[1] performs ``verbatim'' reuse, attempting to
> > +reuse chunks of the existing packfile instead of adding objects to the
> > +packing list.
>
> I'm not sure I full understand this paragraph. In the context of a
> single pack bitmap it's clear enough. But I stumbled over the MIDX case,
> because here we potentially have multiple packfiles, so it's not exactly
> clear to me what you refer to with "the existing packfile" in that case.
> I'd think that we perform verbatim reuse of the preferred packfile,
> right? If so, we might want to make that a bit more explicit.

Yep, sorry, I can see how that would be confusing. Since we're talking
about the existing behavior at this point in the series (before
multi-pack reuse is implemented), I changed this to:

  "reuse chunks of the bitmapped or preferred packfile [...]"

Thanks for carefully reading and spotting my errors ;-).

> > +object. This introduces an additional constraint over the set of packs
> > +we may want to reuse. The most straightforward approach is to mandate
> > +that the set of packs is disjoint with respect to the set of objects
> > +contained in each pack. In other words, for each object `o` in the union
> > +of all objects stored by the disjoint set of packs, `o` is contained in
> > +exactly one pack from the disjoint set.
>
> Is this a property that usually holds for our normal housekeeping, or
> does it require careful managing by the user/admin? How about geometric
> repacking?

At this point in the series, it would require careful managing to ensure
that this is the case. In practice MIDX'd packs generated with a
geometric repack are mostly disjoint, but definitely not guaranteed to
be.

Further down in this series we'll introduce new options to generate
packs which are guaranteed to be disjoint with respect to the
currently-marked set of packs in the DISP chunk.

> > @@ -764,14 +807,22 @@ static struct pack_midx_entry *get_sorted_entries(struct multi_pack_index *m,
> >  		 * Take only the first duplicate.
> >  		 */
> >  		for (cur_object = 0; cur_object < fanout.nr; cur_object++) {
> > -			if (cur_object && oideq(&fanout.entries[cur_object - 1].oid,
> > -						&fanout.entries[cur_object].oid))
> > -				continue;
> > +			struct pack_midx_entry *ours = &fanout.entries[cur_object];
> > +			if (cur_object) {
> > +				struct pack_midx_entry *prev = &fanout.entries[cur_object - 1];
> > +				if (oideq(&prev->oid, &ours->oid)) {
> > +					if (prev->disjoint && ours->disjoint)
> > +						die(_("duplicate object '%s' among disjoint packs '%s', '%s'"),
> > +						    oid_to_hex(&prev->oid),
> > +						    info[prev->pack_int_id].pack_name,
> > +						    info[ours->pack_int_id].pack_name);
>
> Shouldn't we die if `prev->disjoint || ours->disjoint` instead of `&&`?
> Even if one of the packs isn't marked as disjoint, it's still wrong if
> the other one is and one of its objects exists in multiple packs.
>
> Or am I misunderstanding, and we only guarantee the disjoint property
> across packfiles that are actually marked as such?

Right, we only guarantee disjointed-ness among the set of packs that are
marked disjoint. It's fine for the same object to appear in a disjoint
and non-disjoint pack, and for both of those packs to end up in the
MIDX. But that is only because we'll use the disjoint copy in our
bitmap.

If there were two packs that are marked as supposedly disjoint, but
contain at least one duplicate of an object, then we will reject those
packs as non-disjoint.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 07/24] midx: implement `--retain-disjoint` mode
  2023-11-30 10:18   ` Patrick Steinhardt
@ 2023-11-30 19:29     ` Taylor Blau
  2023-12-01  8:02       ` Patrick Steinhardt
  0 siblings, 1 reply; 107+ messages in thread
From: Taylor Blau @ 2023-11-30 19:29 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Jeff King, Junio C Hamano

On Thu, Nov 30, 2023 at 11:18:51AM +0100, Patrick Steinhardt wrote:
> > diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
> > index d130e65b28..ac0c7b124b 100644
> > --- a/Documentation/git-multi-pack-index.txt
> > +++ b/Documentation/git-multi-pack-index.txt
> > @@ -54,6 +54,14 @@ write::
> >  		"disjoint". See the "`DISP` chunk and disjoint packs"
> >  		section in linkgit:gitformat-pack[5] for more.
> >
> > +	--retain-disjoint::
> > +		When writing a multi-pack index with a reachability
> > +		bitmap, keep any packs marked as disjoint in the
> > +		existing MIDX (if any) as such in the new MIDX. Existing
> > +		disjoint packs which are removed (e.g., not listed via
> > +		`--stdin-packs`) are ignored. This option works in
> > +		addition to the '+' marker for `--stdin-packs`.
>
> I'm trying to understand when you're expected to pass this flag and when
> you're expected not to pass it. This documentation could also help in
> the documentation here so that the user can make a more informed
> decision.

I think there are multiple reasons that you may or may not want to pass
that flag. Certainly if you're not using disjoint packs (and instead
only care about single-pack verbatim reuse over the MIDX's preferred
packfile), then you don't need to pass it.

But if you are using disjoint packs, you may want to pass it if you are
adding packs to the MIDX which are disjoint, _and_ you want to hold onto
the existing set of disjoint packs.

But if you want to change the set of disjoint packs entirely, you would
want to omit this flag (unless you knew a-priori that you were going to
drop all of the currently marked disjoint packs from the new MIDX you
are writing, e.g. with --stdin-packs).

If you think it would be useful, I could try and distill some of this
down, but I think that there is likely too much detail here for it to be
useful in user-facing documentation.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 08/24] pack-objects: implement `--ignore-disjoint` mode
  2023-11-30 10:18   ` Patrick Steinhardt
@ 2023-11-30 19:32     ` Taylor Blau
  2023-12-01  8:17       ` Patrick Steinhardt
  0 siblings, 1 reply; 107+ messages in thread
From: Taylor Blau @ 2023-11-30 19:32 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Jeff King, Junio C Hamano

On Thu, Nov 30, 2023 at 11:18:57AM +0100, Patrick Steinhardt wrote:
> > Instead, teach `pack-objects` a special `--ignore-disjoint` which is the
> > moral equivalent of marking the set of disjoint packs as kept, and
> > ignoring their contents, even if it would have otherwise been packed. In
> > fact, this similarity extends down to the implementation, where each
> > disjoint pack is first loaded, then has its `pack_keep_in_core` bit set.
> >
> > With this in place, we can use the kept-pack cache from 20b031fede
> > (packfile: add kept-pack cache for find_kept_pack_entry(), 2021-02-22),
> > which looks up objects first in a cache containing just the set of kept
> > (in this case, disjoint) packs. Assuming that the set of disjoint packs
> > is a relatively small portion of the entire repository (which should be
> > a safe assumption to make), each object lookup will be very inexpensive.
>
> This cought me by surprise a bit. I'd have expected that in the end,
> most of the packfiles in a repository would be disjoint. Using for
> example geometric repacks, my expectation was that all of the packs that
> get written via geometric repacking would eventually become disjoint
> whereas new packs added to the repository would initially not be.

Which part are you referring to here? If you're referring to the part
where I say that the set of disjoint packs is relatively small in
proposition to the rest of the packs, I think I know where the confusion
is.

I'm not saying that the set of disjoint packs is small in comparison to
the rest of the repository by object count, but rather by count of packs
overall. You're right that packs from pushes will not be guaranteed to
be disjoint upon entering the repository, but will become disjoint when
geometrically repacked (assuming that the caller uses --ignore-disjoint
when repacking).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/24] pack-objects: multi-pack verbatim reuse
  2023-11-30 10:18 ` [PATCH 00/24] pack-objects: multi-pack verbatim reuse Patrick Steinhardt
@ 2023-11-30 19:39   ` Taylor Blau
  2023-12-01  8:31     ` Patrick Steinhardt
  0 siblings, 1 reply; 107+ messages in thread
From: Taylor Blau @ 2023-11-30 19:39 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Jeff King, Junio C Hamano

On Thu, Nov 30, 2023 at 11:18:19AM +0100, Patrick Steinhardt wrote:
> > Performing verbatim pack reuse naturally trades off between CPU time and
> > the resulting pack size. In the above example, the single-pack reuse
> > case produces a clone size of ~194 MB on my machine, while the
> > multi-pack reuse case produces a clone size closer to ~266 MB, which is
> > a ~37% increase in clone size.
>
> Quite exciting, and a tradeoff that may be worth it for Git hosters. I
> expect that this is going to be an extreme example of the benefits
> provided by your patch series -- do you by any chance also have "real"
> numbers that make it possible to quantify the effect a bit better?
>
> No worry if you don't, I'm just curious.

I don't have a great sense, no. I haven't run these patches yet in
production, although would like to do so soon for internal repositories
to get a better sense here.

There are some performance tests at the end which try and give you a
sense of at least the relative speed-up depending on how many disjoint
packs you have (IIRC, we test for 1, 10, and 100 disjoint packs).

> > I think there is still some opportunity to close this gap, since the
> > "packing" strategy here is extremely naive. In a production setting, I'm
> > sure that there are more well thought out repacking strategies that
> > would produce more similar clone sizes.
> >
> > I considered breaking this series up into smaller chunks, but was
> > unsatisfied with the result. Since this series is rather large, if you
> > have alternate suggestions on better ways to structure this, please let
> > me know.
>
> The series is indeed very involved to review. I only made it up to patch
> 8/24 and already spent quite some time on it. So I'd certainly welcome
> it if this was split up into smaller parts, but don't have a suggestion
> as to how this should be done (also because I didn't yet read the other
> 16 patches).

I suppose that one way to break it up might be:

    pack-objects: free packing_data in more places
    pack-bitmap-write: deep-clear the `bb_commit` slab
    pack-bitmap: plug leak in find_objects()

    midx: factor out `fill_pack_info()`
    midx: implement `DISP` chunk
    midx: implement `midx_locate_pack()`
    midx: implement `--retain-disjoint` mode

    pack-objects: implement `--ignore-disjoint` mode
    repack: implement `--extend-disjoint` mode

    pack-bitmap: pass `bitmapped_pack` struct to pack-reuse functions
    pack-bitmap: simplify `reuse_partial_packfile_from_bitmap()` signature
    pack-bitmap: return multiple packs via `reuse_partial_packfile_from_bitmap()`
    pack-objects: parameterize pack-reuse routines over a single pack
    pack-objects: keep track of `pack_start` for each reuse pack
    pack-objects: pass `bitmapped_pack`'s to pack-reuse functions
    pack-objects: prepare `write_reused_pack()` for multi-pack reuse
    pack-objects: prepare `write_reused_pack_verbatim()` for multi-pack reuse
    pack-objects: include number of packs reused in output
    pack-bitmap: prepare to mark objects from multiple packs for reuse

    pack-objects: add tracing for various packfile metrics
    t/test-lib-functions.sh: implement `test_trace2_data` helper
    pack-objects: allow setting `pack.allowPackReuse` to "single"
    pack-bitmap: reuse objects from all disjoint packs
    t/perf: add performance tests for multi-pack reuse

Then you'd have five patch series, where each series does roughly the
following:

  1. Preparatory clean-up.
  2. Implementing the DISP chunk, as well as --retain-disjoint, without
     a way to generate such packs.
  3. Implement a way to generate such packs, but without actually being
     able to reuse more than one of them.
  4. Implement multi-pack reuse, but without actually reusing any packs.
  5. Enable multi-pack reuse (and implement the required scaffolding to
     do so), and test it.

That's the most sensible split that I could come up with, at least. But
I still find it relatively unsatisfying for a couple of reasons:

  - With the exception of the last group of patches, none of the earlier
    series enable any new, useful behavior on their own. IOW, if we just
    merged the first three series and then forgot about this topic, we
    wouldn't have done anything useful ;-).

  - The fourth series (which actually implements multi-pack reuse) still
    remains the most complicated, and would likely be the most difficult
    to review. So I think you'd still have one difficult series to
    review, plus four other series which are slightly less difficult to
    review ;-).

It's entirely possible that I'm just too close to these patches to see a
better split, so if you (or others) have any suggestions on how to break
this up, please don't hesitate to share them.

> I'll review the remaining patches at a later point in time.

Thanks, I'll look forward to more of your review as usual :-).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 07/24] midx: implement `--retain-disjoint` mode
  2023-11-30 19:29     ` Taylor Blau
@ 2023-12-01  8:02       ` Patrick Steinhardt
  0 siblings, 0 replies; 107+ messages in thread
From: Patrick Steinhardt @ 2023-12-01  8:02 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Junio C Hamano

[-- Attachment #1: Type: text/plain, Size: 2367 bytes --]

On Thu, Nov 30, 2023 at 02:29:52PM -0500, Taylor Blau wrote:
> On Thu, Nov 30, 2023 at 11:18:51AM +0100, Patrick Steinhardt wrote:
> > > diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
> > > index d130e65b28..ac0c7b124b 100644
> > > --- a/Documentation/git-multi-pack-index.txt
> > > +++ b/Documentation/git-multi-pack-index.txt
> > > @@ -54,6 +54,14 @@ write::
> > >  		"disjoint". See the "`DISP` chunk and disjoint packs"
> > >  		section in linkgit:gitformat-pack[5] for more.
> > >
> > > +	--retain-disjoint::
> > > +		When writing a multi-pack index with a reachability
> > > +		bitmap, keep any packs marked as disjoint in the
> > > +		existing MIDX (if any) as such in the new MIDX. Existing
> > > +		disjoint packs which are removed (e.g., not listed via
> > > +		`--stdin-packs`) are ignored. This option works in
> > > +		addition to the '+' marker for `--stdin-packs`.
> >
> > I'm trying to understand when you're expected to pass this flag and when
> > you're expected not to pass it. This documentation could also help in
> > the documentation here so that the user can make a more informed
> > decision.
> 
> I think there are multiple reasons that you may or may not want to pass
> that flag. Certainly if you're not using disjoint packs (and instead
> only care about single-pack verbatim reuse over the MIDX's preferred
> packfile), then you don't need to pass it.
> 
> But if you are using disjoint packs, you may want to pass it if you are
> adding packs to the MIDX which are disjoint, _and_ you want to hold onto
> the existing set of disjoint packs.
> 
> But if you want to change the set of disjoint packs entirely, you would
> want to omit this flag (unless you knew a-priori that you were going to
> drop all of the currently marked disjoint packs from the new MIDX you
> are writing, e.g. with --stdin-packs).
> 
> If you think it would be useful, I could try and distill some of this
> down, but I think that there is likely too much detail here for it to be
> useful in user-facing documentation.

Yeah, this indeed feels too detailed to be added here. I was hoping for
a simple "Never do this if"-style rule that points out why it is unwise
under some circumstances, but seems like it's not as simple as that.

Well, so be it. Thanks!

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 08/24] pack-objects: implement `--ignore-disjoint` mode
  2023-11-30 19:32     ` Taylor Blau
@ 2023-12-01  8:17       ` Patrick Steinhardt
  2023-12-01 19:58         ` Taylor Blau
  0 siblings, 1 reply; 107+ messages in thread
From: Patrick Steinhardt @ 2023-12-01  8:17 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Junio C Hamano

[-- Attachment #1: Type: text/plain, Size: 3638 bytes --]

On Thu, Nov 30, 2023 at 02:32:24PM -0500, Taylor Blau wrote:
> On Thu, Nov 30, 2023 at 11:18:57AM +0100, Patrick Steinhardt wrote:
> > > Instead, teach `pack-objects` a special `--ignore-disjoint` which is the
> > > moral equivalent of marking the set of disjoint packs as kept, and
> > > ignoring their contents, even if it would have otherwise been packed. In
> > > fact, this similarity extends down to the implementation, where each
> > > disjoint pack is first loaded, then has its `pack_keep_in_core` bit set.
> > >
> > > With this in place, we can use the kept-pack cache from 20b031fede
> > > (packfile: add kept-pack cache for find_kept_pack_entry(), 2021-02-22),
> > > which looks up objects first in a cache containing just the set of kept
> > > (in this case, disjoint) packs. Assuming that the set of disjoint packs
> > > is a relatively small portion of the entire repository (which should be
> > > a safe assumption to make), each object lookup will be very inexpensive.
> >
> > This cought me by surprise a bit. I'd have expected that in the end,
> > most of the packfiles in a repository would be disjoint. Using for
> > example geometric repacks, my expectation was that all of the packs that
> > get written via geometric repacking would eventually become disjoint
> > whereas new packs added to the repository would initially not be.
> 
> Which part are you referring to here? If you're referring to the part
> where I say that the set of disjoint packs is relatively small in
> proposition to the rest of the packs, I think I know where the confusion
> is.

Yeah, that's what I was referring to.

> I'm not saying that the set of disjoint packs is small in comparison to
> the rest of the repository by object count, but rather by count of packs
> overall. You're right that packs from pushes will not be guaranteed to
> be disjoint upon entering the repository, but will become disjoint when
> geometrically repacked (assuming that the caller uses --ignore-disjoint
> when repacking).

I was actually thinking about it in the number of packfiles, not number
of objects. I'm mostly coming from the angle of geometric repacking
here, where it is totally expected that you have a comparatively large
number of packfiles when your repository is big. With a geometric factor
of 2, you'll have up to `log2($numobjects)` many packfiles in your repo
while keeping the geometric sequence intact.

In something like linux.git with almost 10M objects that boils down to
23 packfiles, and I'd assume that all of these would be disjoint in the
best case. So if you gain new packfiles by pushing into the repository
then I'd think that it's quite likely that the number of non-disjoint
packfiles is smaller than the number of disjoint ones.

I do realize though that in absolute numbers, this isn't all that many.
I was also thinking ahead though to a future where we have something
like geometric repacking with maximum packfile sizes working well
together so that we'll be able to merge packfiles together until they
reach a certain maximum size, and afterwards they are just left alone.
This would help to avoid those "surprise" repack cases where everything
is again packed into a single packfile for the biggest repositories out
there. But it would of course also lead to an increase in packfiles in
those huge repositories.

Anyway, I feel like I'm rambling. In the end it's probably going to be
fine, I was simply surprised by your assumption that the number of
disjoint packfiles should usually be much smaller than the number of
non-disjoint ones.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/24] pack-objects: multi-pack verbatim reuse
  2023-11-30 19:39   ` Taylor Blau
@ 2023-12-01  8:31     ` Patrick Steinhardt
  2023-12-01 20:02       ` Taylor Blau
  0 siblings, 1 reply; 107+ messages in thread
From: Patrick Steinhardt @ 2023-12-01  8:31 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Junio C Hamano

[-- Attachment #1: Type: text/plain, Size: 2866 bytes --]

On Thu, Nov 30, 2023 at 02:39:41PM -0500, Taylor Blau wrote:
> On Thu, Nov 30, 2023 at 11:18:19AM +0100, Patrick Steinhardt wrote:
[snip]
> Then you'd have five patch series, where each series does roughly the
> following:
> 
>   1. Preparatory clean-up.
>   2. Implementing the DISP chunk, as well as --retain-disjoint, without
>      a way to generate such packs.
>   3. Implement a way to generate such packs, but without actually being
>      able to reuse more than one of them.
>   4. Implement multi-pack reuse, but without actually reusing any packs.
>   5. Enable multi-pack reuse (and implement the required scaffolding to
>      do so), and test it.
> 
> That's the most sensible split that I could come up with, at least.

Looks sensible to me.

> But
> I still find it relatively unsatisfying for a couple of reasons:
> 
>   - With the exception of the last group of patches, none of the earlier
>     series enable any new, useful behavior on their own. IOW, if we just
>     merged the first three series and then forgot about this topic, we
>     wouldn't have done anything useful ;-).

Well, sometimes I wish we'd buy more into the iterative style of working
in the Git project, where it's fine to land patch series that only work
into the direction of a specific topic but don't yet do anything
interesting by themselves. The benefits would be both that individual
pieces land faster while also ensuring that the review load is kept at
bay.

But there's of course also downsides to this, especially in an open
source project like Git:

  - Contributors may go away in the middle of their endeavour, leaving
    behind dangling pieces that might have complicated some of our
    architecture without actually reaping the intended benefits.

  - Overall, it may take significantly longer to get all pieces into
    Git.

  - Contributors need to do a much better job to explain where they are
    headed when the series by itself doesn't do anything interesting
    yet.

So I understand why we typically don't work this way.

I wonder whether a compromise would be to continue sending complete
patch series, but explicitly point out "break points" in your patch
series. These break points could be an indicator to the maintainer that
it's fine to merge everything up to it while still keeping out the
remainder of the patch series.

>   - The fourth series (which actually implements multi-pack reuse) still
>     remains the most complicated, and would likely be the most difficult
>     to review. So I think you'd still have one difficult series to
>     review, plus four other series which are slightly less difficult to
>     review ;-).

That's fine in my opinion, there's no surprise here that a complicated
topic is demanding for both the author and the reviewer.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 08/24] pack-objects: implement `--ignore-disjoint` mode
  2023-12-01  8:17       ` Patrick Steinhardt
@ 2023-12-01 19:58         ` Taylor Blau
  0 siblings, 0 replies; 107+ messages in thread
From: Taylor Blau @ 2023-12-01 19:58 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Jeff King, Junio C Hamano

On Fri, Dec 01, 2023 at 09:17:38AM +0100, Patrick Steinhardt wrote:
> In something like linux.git with almost 10M objects that boils down to
> 23 packfiles, and I'd assume that all of these would be disjoint in the
> best case. So if you gain new packfiles by pushing into the repository
> then I'd think that it's quite likely that the number of non-disjoint
> packfiles is smaller than the number of disjoint ones.

Right, although if you have 10M objects over 23 packs with a geometric
repacking factor of two, the last pack should have just around a single
object in it. In other words, as soon as you receive a push, your
geometric progression will collapse into a single pack.

So having a repository with 10M objects split across 23 packs is a
relatively short-lived state. And in general we should only be in that
state every time a repository doubles (again, assuming a factor of two).

In that sense, I'd expect relatively few packs to be disjoint, and for
each of those packs to have a relatively large number of objects,
accounting for most of the non-recent parts of the repository.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/24] pack-objects: multi-pack verbatim reuse
  2023-12-01  8:31     ` Patrick Steinhardt
@ 2023-12-01 20:02       ` Taylor Blau
  2023-12-04  8:49         ` Patrick Steinhardt
  0 siblings, 1 reply; 107+ messages in thread
From: Taylor Blau @ 2023-12-01 20:02 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Jeff King, Junio C Hamano

On Fri, Dec 01, 2023 at 09:31:14AM +0100, Patrick Steinhardt wrote:
> > But
> > I still find it relatively unsatisfying for a couple of reasons:
> >
> >   - With the exception of the last group of patches, none of the earlier
> >     series enable any new, useful behavior on their own. IOW, if we just
> >     merged the first three series and then forgot about this topic, we
> >     wouldn't have done anything useful ;-).
>
> Well, sometimes I wish we'd buy more into the iterative style of working
> in the Git project, where it's fine to land patch series that only work
> into the direction of a specific topic but don't yet do anything
> interesting by themselves. The benefits would be both that individual
> pieces land faster while also ensuring that the review load is kept at
> bay.
>
> But there's of course also downsides to this, especially in an open
> source project like Git:

I tend to agree with the downsides you list. My biggest concern with
this series in particular is that we're trying to break down an
all-or-nothing change into smaller components. So if we landed four out
of the five of those series, it would be better to have landed none of
them, since the first four aren't really all that useful on their own.

I suppose if we're relatively confident that the last series will be
merged eventually, then that seems like less of a concern. But I'm not
sure that we're at that point yet.

> I wonder whether a compromise would be to continue sending complete
> patch series, but explicitly point out "break points" in your patch
> series. These break points could be an indicator to the maintainer that
> it's fine to merge everything up to it while still keeping out the
> remainder of the patch series.

I think that's a reasonable alternative. It does create a minor headache
for the maintainer[^1], though, so I'd like to avoid it if possible.

> >   - The fourth series (which actually implements multi-pack reuse) still
> >     remains the most complicated, and would likely be the most difficult
> >     to review. So I think you'd still have one difficult series to
> >     review, plus four other series which are slightly less difficult to
> >     review ;-).
>
> That's fine in my opinion, there's no surprise here that a complicated
> topic is demanding for both the author and the reviewer.

My preference is to avoid splitting the series if we can help it. But if
you feel strongly, or others feel similarly, I'm happy to take another
crack at breaking it up. Thanks for all of your review so far!

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 05/24] midx: implement `DISP` chunk
  2023-11-28 19:08 ` [PATCH 05/24] midx: implement `DISP` chunk Taylor Blau
  2023-11-30 10:18   ` Patrick Steinhardt
@ 2023-12-03 13:15   ` Junio C Hamano
  2023-12-05 19:26     ` Taylor Blau
  1 sibling, 1 reply; 107+ messages in thread
From: Junio C Hamano @ 2023-12-03 13:15 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Patrick Steinhardt

Taylor Blau <me@ttaylorr.com> writes:

> diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
> index 3696506eb3..d130e65b28 100644
> --- a/Documentation/git-multi-pack-index.txt
> +++ b/Documentation/git-multi-pack-index.txt
> @@ -49,6 +49,10 @@ write::
>  	--stdin-packs::
>  		Write a multi-pack index containing only the set of
>  		line-delimited pack index basenames provided over stdin.
> +		Lines beginning with a '+' character (followed by the
> +		pack index basename as before) have their pack marked as
> +		"disjoint". See the "`DISP` chunk and disjoint packs"
> +		section in linkgit:gitformat-pack[5] for more.

Makes one wonder who computes the set of packfiles, decides to
prefix '+' to which ones, and how it does so, none of which appear
in this step (which is understandable).  As the flow of information
is from the producer of individual "disjoint" packs (not in this
step) to this new logic in "--stdin-packs" to the new "DISP" chunk
writer (the primary focus of this step) to the final consumer of
"DISP" chunk (not in this step), we are digging from the middle
(hopefully to both directions in other steps).  It is probably the
easiest way to explain the idea to start from the primary data
structures and "DISP" seems to be a good place to start.

> +	    Two packs are "disjoint" with respect to one another when they have
> +	    disjoint sets of objects.
> + In other words, any object found in a pack
> +	    contained in the set of disjoint packfiles is guaranteed to be
> +	    uniquely located among those packs.

I often advise people to rethink what they wrote _before_ "In other
words", because the use of that phrase is a sign that the author
considers the statement is hard to grok and needs rephrasing, in
which case, the rephrased version may be a better way to explain the
concept being presented without the harder-to-grok version.

But I do not think this one is a good example to apply the advice.
It is because "In other words," is somewhat misused in the sentence.
Two "disjoint" packs do not store any common object (which is how
you defined the adjective "disjoint" in the first sentence).  "As a
consequence"/"Hence", an object found in one pack among many
"disjoint" packs will not appear in others.

By the way, how strict does this disjointness have to be?

Let's examine an extreme case.  When you have two packs that are
"mostly" disjoint, but have one single object in common, how would
that object interfere with the bulk streaming of existing packdata
out of these two packs?  Would we be able to, say, safely pretend
that the problematic single object lives only in one but not in the
other (in other words, can we safely "ignore" the presence of the
copy in the other pack)?  I think it would break down if that
ignored copy is used as a delta base of another object in the same
pack, and the base object for the delta is recorded as OFS_DELTA
(which most likely every delta is these days), because we no longer
can stream out such deltified object without re-pointing its base to
the other copy, which will in turn screw up the relative offset of
other objects in the same stream.

OK, so it seems they really need to be strictly disjoint in order to
participate in the reuse of the existing packdata.  

> +When a chunk of bytes are reused from an existing pack, any objects
> +contained therein do not need to be added to the packing list, saving
> +memory and CPU time. But a chunk from an existing packfile can only be
> +reused when the following conditions are met:
> +
> +  - The chunk contains only objects which were requested by the caller
> +    (i.e. does not contain any objects which the caller didn't ask for
> +    explicitly or implicitly).

OK.

> +  - All objects stored as offset- or reference-deltas also include their
> +    base object in the resulting pack.

Are thin packs obsolete?

> diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
> index c4c6060cee..fd24e0c952 100755
> --- a/t/t5319-multi-pack-index.sh
> +++ b/t/t5319-multi-pack-index.sh
> @@ -1157,4 +1157,62 @@ test_expect_success 'reader notices too-small revindex chunk' '
>  	test_cmp expect.err err
>  '
>  
> +test_expect_success 'disjoint packs are stored via the DISP chunk' '
> +	test_when_finished "rm -fr repo" &&
> +	git init repo &&
> +	(
> +		cd repo &&
> +
> +		for i in 1 2 3 4 5
> +		do
> +			test_commit "$i" &&
> +			git repack -d || return 1
> +		done &&
> +
> +		find $objdir/pack -type f -name "*.idx" | xargs -n 1 basename | sort >packs &&

That is an overly-long line.

> +test_expect_success 'non-disjoint packs are detected' '
> +	test_when_finished "rm -fr repo" &&
> +	git init repo &&
> +	(
> +		cd repo &&
> +
> +		test_commit base &&
> +		git repack -d &&
> +		test_commit other &&
> +		git repack -a &&
> +
> +		ls -la .git/objects/pack/ &&

Is this line a leftover debugging aid?

> +		find $objdir/pack -type f -name "*.idx" |
> +			sed -e "s/.*\/\(.*\)$/+\1/g" >in &&

Lose "g"; it adds unnecessary cognitive burden to the readers if the
patterh is expected to match multiple times, and you know that is
not possible (your pattern is right anchored at the end).  This may
apply equally to other uses of "sed" in this patch.

> +		test_must_fail git multi-pack-index write --stdin-packs \
> +			--bitmap <in 2>err &&
> +		grep "duplicate object.* among disjoint packs" err
> +	)
> +'
> +
>  test_done

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/24] pack-objects: multi-pack verbatim reuse
  2023-12-01 20:02       ` Taylor Blau
@ 2023-12-04  8:49         ` Patrick Steinhardt
  0 siblings, 0 replies; 107+ messages in thread
From: Patrick Steinhardt @ 2023-12-04  8:49 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Junio C Hamano

[-- Attachment #1: Type: text/plain, Size: 2098 bytes --]

On Fri, Dec 01, 2023 at 03:02:22PM -0500, Taylor Blau wrote:
> On Fri, Dec 01, 2023 at 09:31:14AM +0100, Patrick Steinhardt wrote:
[snip]
> I suppose if we're relatively confident that the last series will be
> merged eventually, then that seems like less of a concern. But I'm not
> sure that we're at that point yet.

That's an additional valid concern indeed.

[snip]
> > >   - The fourth series (which actually implements multi-pack reuse) still
> > >     remains the most complicated, and would likely be the most difficult
> > >     to review. So I think you'd still have one difficult series to
> > >     review, plus four other series which are slightly less difficult to
> > >     review ;-).
> >
> > That's fine in my opinion, there's no surprise here that a complicated
> > topic is demanding for both the author and the reviewer.
> 
> My preference is to avoid splitting the series if we can help it. But if
> you feel strongly, or others feel similarly, I'm happy to take another
> crack at breaking it up. Thanks for all of your review so far!

I don't feel strongly about this at all, I've only tried to spell out my
own thoughts in this context as I thought they were kind of relevant
here. I've thought quite a lot about this topic recently due to my work
on the reftable backend, where I'm trying to get as many pieces as
possible landed individually before landing the actual backend itself.
It's working well for most of the part, but in other contexts it's a bit
weird that we try to cater towards something that doesn't exist yet. But
naturally, the reftable work is of different nature than the topic you
work on here and thus my own takeaways may likely not apply heer.

To summarize, I think there is merit in splitting up patches into chunks
that make it in individually and thus gradually work toward a topic, but
I also totally understand why you (or Junio as the maintainer) might
think that this is not a good idea. The ultimate decision for how to
handle topics should be with the patch series' author and maintainer.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 05/24] midx: implement `DISP` chunk
  2023-12-03 13:15   ` Junio C Hamano
@ 2023-12-05 19:26     ` Taylor Blau
  2023-12-09  1:40       ` Junio C Hamano
  0 siblings, 1 reply; 107+ messages in thread
From: Taylor Blau @ 2023-12-05 19:26 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Jeff King, Patrick Steinhardt

On Sun, Dec 03, 2023 at 10:15:11PM +0900, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
> > index 3696506eb3..d130e65b28 100644
> > --- a/Documentation/git-multi-pack-index.txt
> > +++ b/Documentation/git-multi-pack-index.txt
> > @@ -49,6 +49,10 @@ write::
> >  	--stdin-packs::
> >  		Write a multi-pack index containing only the set of
> >  		line-delimited pack index basenames provided over stdin.
> > +		Lines beginning with a '+' character (followed by the
> > +		pack index basename as before) have their pack marked as
> > +		"disjoint". See the "`DISP` chunk and disjoint packs"
> > +		section in linkgit:gitformat-pack[5] for more.
>
> Makes one wonder who computes the set of packfiles, decides to
> prefix '+' to which ones, and how it does so, none of which appear
> in this step (which is understandable).  As the flow of information
> is from the producer of individual "disjoint" packs (not in this
> step) to this new logic in "--stdin-packs" to the new "DISP" chunk
> writer (the primary focus of this step) to the final consumer of
> "DISP" chunk (not in this step), we are digging from the middle
> (hopefully to both directions in other steps).  It is probably the
> easiest way to explain the idea to start from the primary data
> structures and "DISP" seems to be a good place to start.

Thanks. I found that laying out this series was rather tricky, since all
of the individual pieces really depend on the end state in order to make
any sense.

Hopefully you're satisfied with the way things are split up and
organized currently, but if you have suggestions on other ways I could
slice or dice this, please let me know.

> > +	    Two packs are "disjoint" with respect to one another when they have
> > +	    disjoint sets of objects.
> > + In other words, any object found in a pack
> > +	    contained in the set of disjoint packfiles is guaranteed to be
> > +	    uniquely located among those packs.
>
> I often advise people to rethink what they wrote _before_ "In other
> words", because the use of that phrase is a sign that the author
> considers the statement is hard to grok and needs rephrasing, in
> which case, the rephrased version may be a better way to explain the
> concept being presented without the harder-to-grok version.
>
> But I do not think this one is a good example to apply the advice.
> It is because "In other words," is somewhat misused in the sentence.
> Two "disjoint" packs do not store any common object (which is how
> you defined the adjective "disjoint" in the first sentence).  "As a
> consequence"/"Hence", an object found in one pack among many
> "disjoint" packs will not appear in others.

Thanks, I'll replace this with "As a consequence", and try to follow
that general advice more often in the future ;-).

> OK, so it seems they really need to be strictly disjoint in order to
> participate in the reuse of the existing packdata.

I think that's generally true, though there are some exceptions.

I think the real condition here is that the *reused sections* must be
disjoint with respect to one another, not necessarily the packs
themselves. So having the packs be disjoint is a sufficient condition,
since we know that no matter which section(s) we reuse, they are
guaranteed to be disjoint.

I think that there is opportunity to be more clever here, e.g., by
allowing for different disjoint "groups" of packs, or mandating that you
can only reuse certain sections from different combinations of packs in
order to satisfy this property.

That's part of the reason why I left more space than is needed for the
"disjoint" state in the DISP chunk (it is 32 bits, of which we're only
using one of them). I'm not sure that we would want more relaxed
constraints here, since they'd be harder to satisfy. But my hope is that
we would be able to learn from running this in production to figure out
whether or not such a thing would be useful.

> > +  - All objects stored as offset- or reference-deltas also include their
> > +    base object in the resulting pack.
>
> Are thin packs obsolete?

No, I think I should clarify this to make it more obvious that this only
applies to non-thin packs.

> > +		find $objdir/pack -type f -name "*.idx" | xargs -n 1 basename | sort >packs &&
>
> That is an overly-long line.

Thanks for spotting.

> > +test_expect_success 'non-disjoint packs are detected' '
> > +	test_when_finished "rm -fr repo" &&
> > +	git init repo &&
> > +	(
> > +		cd repo &&
> > +
> > +		test_commit base &&
> > +		git repack -d &&
> > +		test_commit other &&
> > +		git repack -a &&
> > +
> > +		ls -la .git/objects/pack/ &&
>
> Is this line a leftover debugging aid?

Indeed, thanks.

> > +		find $objdir/pack -type f -name "*.idx" |
> > +			sed -e "s/.*\/\(.*\)$/+\1/g" >in &&
>
> Lose "g"; it adds unnecessary cognitive burden to the readers if the
> patterh is expected to match multiple times, and you know that is
> not possible (your pattern is right anchored at the end).  This may
> apply equally to other uses of "sed" in this patch.

Thanks, I dropped the 'g' in both instances.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 09/24] repack: implement `--extend-disjoint` mode
  2023-11-28 19:08 ` [PATCH 09/24] repack: implement `--extend-disjoint` mode Taylor Blau
@ 2023-12-07 13:13   ` Patrick Steinhardt
  2023-12-07 20:28     ` Taylor Blau
  0 siblings, 1 reply; 107+ messages in thread
From: Patrick Steinhardt @ 2023-12-07 13:13 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Junio C Hamano

[-- Attachment #1: Type: text/plain, Size: 3299 bytes --]

On Tue, Nov 28, 2023 at 02:08:18PM -0500, Taylor Blau wrote:
> Now that we can generate packs which are disjoint with respect to the
> set of currently-disjoint packs, implement a mode of `git repack` which
> extends the set of disjoint packs with any new (non-cruft) pack(s)
> generated during the repack.
> 
> The idea is mostly straightforward, with a couple of gotcha's. The
> straightforward part is to make sure that any new packs are disjoint
> with respect to the set of currently disjoint packs which are _not_
> being removed from the repository as a result of the repack.
> 
> If a pack which is currently marked as disjoint is, on the other hand,
> about to be removed from the repository, it is OK (and expected) that
> new pack(s) will contain some or all of its objects. Since the pack
> originally marked as disjoint will be removed, it will necessarily leave
> the disjoint set, making room for new packs with its same objects to
> take its place. In other words, the resulting set of disjoint packs will
> be disjoint with respect to one another.
> 
> The gotchas mostly have to do with making sure that we do not generate a
> disjoint pack in the following scenarios:

Okay, let me verify whether I understand the reasons:

>   - promisor packs

Which is because promisor packs actually don't contain any objects?

>   - cruft packs (which may necessarily need to include an object from a
>     disjoint pack in order to freshen it in certain circumstances)

This one took me a while to figure out. If we'd mark crufts as disjoint,
then it would mean that new packfiles cannot be marked as disjoint if
objects which were previously unreachable do become reachable again.
So we'd be pessimizing packfiles for live objects in favor of others
which aren't.

>   - all-into-one repacks without '-d'

Because here the old packfiles that this would make redundant aren't
deleted and thus the objects are duplicate now.

>   - `--filter-to`, which conceptually could work with the new
>     `--extend-disjoint` option, but only in limited circumstances

We're probably also not properly set up to handle the new alternate
object directory and exclude objects that are part of a potentially
disjoint packfile that exists already. Also, the current MIDX may not
even cover the alternate.

> Otherwise, we mark which packs were created as disjoint by using a new
> bit in the `generated_pack_data` struct, and then marking those pack(s)
> as disjoint accordingly when generating the MIDX. Non-deleted packs
> which are marked as disjoint are retained as such by passing the
> equivalent of `--retain-disjoint` when calling the MIDX API to update
> the MIDX.

Okay. I had a bit of trouble to sift through the various different
flags like "--retain-disjoint", "--extend-disjoint", "--ignore-disjoint"
and so on. But well, they do different things and it's been a few days
since I've reviewed the preceding patches, so this is probably fine.

One thing I wondered: do we need to consider the `-l` flag? When using
an alternate object directory it is totally feasible that the alternate
may be creating new disjoint packages without us knowing, and thus we
may not be able to guarantee the disjoint property anymore.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 10/24] pack-bitmap: pass `bitmapped_pack` struct to pack-reuse functions
  2023-11-28 19:08 ` [PATCH 10/24] pack-bitmap: pass `bitmapped_pack` struct to pack-reuse functions Taylor Blau
@ 2023-12-07 13:13   ` Patrick Steinhardt
  2023-12-07 20:34     ` Taylor Blau
  0 siblings, 1 reply; 107+ messages in thread
From: Patrick Steinhardt @ 2023-12-07 13:13 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Junio C Hamano

[-- Attachment #1: Type: text/plain, Size: 2313 bytes --]

On Tue, Nov 28, 2023 at 02:08:21PM -0500, Taylor Blau wrote:
[snip]
> @@ -2002,6 +1986,65 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
>  
>  done:
>  	unuse_pack(&w_curs);
> +}
> +
> +int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
> +				       struct packed_git **packfile_out,
> +				       uint32_t *entries,
> +				       struct bitmap **reuse_out)
> +{
> +	struct repository *r = the_repository;
> +	struct bitmapped_pack *packs = NULL;
> +	struct bitmap *result = bitmap_git->result;
> +	struct bitmap *reuse;
> +	size_t i;
> +	size_t packs_nr = 0, packs_alloc = 0;
> +	size_t word_alloc;
> +	uint32_t objects_nr = 0;
> +
> +	assert(result);
> +
> +	load_reverse_index(r, bitmap_git);
> +
> +	if (bitmap_is_midx(bitmap_git)) {
> +		for (i = 0; i < bitmap_git->midx->num_packs; i++) {
> +			struct bitmapped_pack pack;
> +			if (nth_bitmapped_pack(r, bitmap_git->midx, &pack, i) < 0) {
> +				warning(_("unable to load pack: '%s', disabling pack-reuse"),
> +					bitmap_git->midx->pack_names[i]);
> +				free(packs);
> +				return -1;
> +			}
> +			if (!pack.bitmap_nr)
> +				continue; /* no objects from this pack */
> +			if (pack.bitmap_pos)
> +				continue; /* not preferred pack */
> +
> +			ALLOC_GROW(packs, packs_nr + 1, packs_alloc);
> +			memcpy(&packs[packs_nr++], &pack, sizeof(pack));
> +
> +			objects_nr += pack.p->num_objects;
> +		}
> +	} else {
> +		ALLOC_GROW(packs, packs_nr + 1, packs_alloc);
> +
> +		packs[packs_nr].p = bitmap_git->pack;
> +		packs[packs_nr].bitmap_pos = 0;
> +		packs[packs_nr].bitmap_nr = bitmap_git->pack->num_objects;
> +		packs[packs_nr].disjoint = 1;
> +
> +		objects_nr = packs[packs_nr++].p->num_objects;
> +	}
> +
> +	word_alloc = objects_nr / BITS_IN_EWORD;
> +	if (objects_nr % BITS_IN_EWORD)
> +		word_alloc++;
> +	reuse = bitmap_word_alloc(word_alloc);
> +
> +	if (packs_nr != 1)
> +		BUG("pack reuse not yet implemented for multiple packs");

Can't it happen that we have no pack here? In the MIDX-case we skip all
packs that either do not have a bitmap or are not preferred. So does it
mean that in reverse, every preferred packfile must have a a bitmap? I'd
think that to not be true in case bitmaps are turned off.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 11/24] pack-bitmap: simplify `reuse_partial_packfile_from_bitmap()` signature
  2023-11-28 19:08 ` [PATCH 11/24] pack-bitmap: simplify `reuse_partial_packfile_from_bitmap()` signature Taylor Blau
@ 2023-12-07 13:13   ` Patrick Steinhardt
  2023-12-07 14:36     ` Taylor Blau
  0 siblings, 1 reply; 107+ messages in thread
From: Patrick Steinhardt @ 2023-12-07 13:13 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Junio C Hamano

[-- Attachment #1: Type: text/plain, Size: 2520 bytes --]

On Tue, Nov 28, 2023 at 02:08:24PM -0500, Taylor Blau wrote:
> The signature of `reuse_partial_packfile_from_bitmap()` currently takes
> in a bitmap, as well as three output parameters (filled through
> pointers, and passed as arguments), and also returns an integer result.
> 
> The output parameters are filled out with: (a) the packfile used for
> pack-reuse, (b) the number of objects from that pack that we can reuse,
> and (c) a bitmap indicating which objects we can reuse. The return value
> is either -1 (when there are no objects to reuse), or 0 (when there is
> at least one object to reuse).
> 
> Some of these parameters are redundant. Notably, we can infer from the
> bitmap how many objects are reused by calling bitmap_popcount(). And we
> can similar compute the return value based on that number as well.
> 
> As such, clean up the signature of this function to drop the "*entries"
> parameter, as well as the int return value, since the single caller of
> this function can infer these values themself.
> 
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
>  builtin/pack-objects.c | 16 +++++++++-------
>  pack-bitmap.c          | 16 +++++++---------
>  pack-bitmap.h          |  7 +++----
>  3 files changed, 19 insertions(+), 20 deletions(-)
> 
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index 107154db34..2bb1b64e8f 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -3946,13 +3946,15 @@ static int get_object_list_from_bitmap(struct rev_info *revs)
>  	if (!(bitmap_git = prepare_bitmap_walk(revs, 0)))
>  		return -1;
>  
> -	if (pack_options_allow_reuse() &&
> -	    !reuse_partial_packfile_from_bitmap(
> -			bitmap_git,
> -			&reuse_packfile,
> -			&reuse_packfile_objects,
> -			&reuse_packfile_bitmap)) {
> -		assert(reuse_packfile_objects);
> +	if (pack_options_allow_reuse())
> +		reuse_partial_packfile_from_bitmap(bitmap_git, &reuse_packfile,
> +						   &reuse_packfile_bitmap);
> +
> +	if (reuse_packfile) {
> +		reuse_packfile_objects = bitmap_popcount(reuse_packfile_bitmap);
> +		if (!reuse_packfile_objects)
> +			BUG("expected non-empty reuse bitmap");

We're now re-computing `bitmap_popcount()` for the bitmap a second time.
But I really don't think this is ever going to be a problem in practice
given that it only does a bunch of math. Any performance regression
would thus ultimately be drowned out by everything else.

In other words: this is probably fine.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 14/24] pack-objects: keep track of `pack_start` for each reuse pack
  2023-11-28 19:08 ` [PATCH 14/24] pack-objects: keep track of `pack_start` for each reuse pack Taylor Blau
@ 2023-12-07 13:13   ` Patrick Steinhardt
  2023-12-07 20:43     ` Taylor Blau
  0 siblings, 1 reply; 107+ messages in thread
From: Patrick Steinhardt @ 2023-12-07 13:13 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Junio C Hamano

[-- Attachment #1: Type: text/plain, Size: 4093 bytes --]

On Tue, Nov 28, 2023 at 02:08:32PM -0500, Taylor Blau wrote:
> When reusing objects from a pack, we keep track of a set of one or more
> `reused_chunk`s, corresponding to sections of one or more object(s) from
> a source pack that we are reusing. Each chunk contains two pieces of
> information:
> 
>   - the offset of the first object in the source pack (relative to the
>     beginning of the source pack)
>   - the difference between that offset, and the corresponding offset in
>     the pack we're generating
> 
> The purpose of keeping track of these is so that we can patch an
> OFS_DELTAs that cross over a section of the reuse pack that we didn't
> take.
> 
> For instance, consider a hypothetical pack as shown below:
> 
>                                                 (chunk #2)
>                                                 __________...
>                                                /
>                                               /
>       +--------+---------+-------------------+---------+
>   ... | <base> | <other> |      (unused)     | <delta> | ...
>       +--------+---------+-------------------+---------+
>        \                /
>         \______________/
>            (chunk #1)
> 
> Suppose that we are sending objects "base", "other", and "delta", and
> that the "delta" object is stored as an OFS_DELTA, and that its base is
> "base". If we don't send any objects in the "(unused)" range, we can't
> copy the delta'd object directly, since its delta offset includes a
> range of the pack that we didn't copy, so we have to account for that
> difference when patching and reassembling the delta.
> 
> In order to compute this value correctly, we need to know not only where
> we are in the packfile we're assembling (with `hashfile_total(f)`) but
> also the position of the first byte of the packfile that we are
> currently reusing.
> 
> Together, these two allow us to compute the reused chunk's offset
> difference relative to the start of the reused pack, as desired.

Hm. I'm not quite sure I fully understand the motivation here. Is this
something that was broken all along? Why does it become a problem now?
Sorry if I'm missing the obvious here.

> Helped-by: Jeff King <peff@peff.net>
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
>  builtin/pack-objects.c | 11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index 7682bd65bb..eb8be514d1 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -1016,6 +1016,7 @@ static off_t find_reused_offset(off_t where)
>  
>  static void write_reused_pack_one(struct packed_git *reuse_packfile,
>  				  size_t pos, struct hashfile *out,
> +				  off_t pack_start,
>  				  struct pack_window **w_curs)
>  {
>  	off_t offset, next, cur;
> @@ -1025,7 +1026,8 @@ static void write_reused_pack_one(struct packed_git *reuse_packfile,
>  	offset = pack_pos_to_offset(reuse_packfile, pos);
>  	next = pack_pos_to_offset(reuse_packfile, pos + 1);
>  
> -	record_reused_object(offset, offset - hashfile_total(out));
> +	record_reused_object(offset,
> +			     offset - (hashfile_total(out) - pack_start));
>  
>  	cur = offset;
>  	type = unpack_object_header(reuse_packfile, w_curs, &cur, &size);
> @@ -1095,6 +1097,7 @@ static void write_reused_pack_one(struct packed_git *reuse_packfile,
>  
>  static size_t write_reused_pack_verbatim(struct packed_git *reuse_packfile,
>  					 struct hashfile *out,
> +					 off_t pack_start UNUSED,
>  					 struct pack_window **w_curs)
>  {
>  	size_t pos = 0;
> @@ -1126,10 +1129,12 @@ static void write_reused_pack(struct packed_git *reuse_packfile,
>  {
>  	size_t i = 0;
>  	uint32_t offset;
> +	off_t pack_start = hashfile_total(f) - sizeof(struct pack_header);

Given that this patch in its current state doesn't seem to do anything
yet, am I right in assuming that `hashfile_total(f) - sizeof(struct
pack_header)` is always expected to be zero for now?

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 16/24] pack-objects: prepare `write_reused_pack()` for multi-pack reuse
  2023-11-28 19:08 ` [PATCH 16/24] pack-objects: prepare `write_reused_pack()` for multi-pack reuse Taylor Blau
@ 2023-12-07 13:13   ` Patrick Steinhardt
  2023-12-07 20:47     ` Taylor Blau
  0 siblings, 1 reply; 107+ messages in thread
From: Patrick Steinhardt @ 2023-12-07 13:13 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Junio C Hamano

[-- Attachment #1: Type: text/plain, Size: 1055 bytes --]

On Tue, Nov 28, 2023 at 02:08:37PM -0500, Taylor Blau wrote:
> The function `write_reused_pack()` within `builtin/pack-objects.c` is
> responsible for performing pack-reuse on a single pack, and has two main
> functions:
> 
>   - it dispatches a call to `write_reused_pack_verbatim()` to see if we
>     can reuse portions of the packfile in whole-word chunks
> 
>   - for any remaining objects (that is, any objects that appear after
>     the first "gap" in the bitmap), call write_reused_pack_one() on that
>     object to record it for reuse.
> 
> Prepare this function for multi-pack reuse by removing the assumption
> that the bit position corresponding to the first object being reused
> from a given pack may not be at bit position zero.

Is this double-negation intended? We remove the assumption that we start
reading at position zero, but the paragraph here states that we remove
the assumption that we do _not_ start at bit zero.

I'll stop reviewing here and will come back to this series somewhen next
week.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 11/24] pack-bitmap: simplify `reuse_partial_packfile_from_bitmap()` signature
  2023-12-07 13:13   ` Patrick Steinhardt
@ 2023-12-07 14:36     ` Taylor Blau
  0 siblings, 0 replies; 107+ messages in thread
From: Taylor Blau @ 2023-12-07 14:36 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Jeff King, Junio C Hamano

On Thu, Dec 07, 2023 at 02:13:19PM +0100, Patrick Steinhardt wrote:
> > +	if (reuse_packfile) {
> > +		reuse_packfile_objects = bitmap_popcount(reuse_packfile_bitmap);
> > +		if (!reuse_packfile_objects)
> > +			BUG("expected non-empty reuse bitmap");
>
> We're now re-computing `bitmap_popcount()` for the bitmap a second time.
> But I really don't think this is ever going to be a problem in practice
> given that it only does a bunch of math. Any performance regression
> would thus ultimately be drowned out by everything else.
>
> In other words: this is probably fine.

I definitely agree that any performance regression from calling
bitmap_popcount() twice would be drowned out by the rest of what
pack-objects is doing.

For what it's worth:

- The bitmap_popcount() call is a loop over ewah_bit_popcount64() for
  each of the allocated words. And the latter is more or less three
  copies of:

      b7:	55 55 55
      ba:	48 23 45 f8          	and    -0x8(%rbp),%rax
      be:	48 8b 55 f8          	mov    -0x8(%rbp),%rdx
      c2:	48 89 d1             	mov    %rdx,%rcx
      c5:	48 d1 e9             	shr    %rcx
      c8:	48 ba 55 55 55 55 55 	movabs $0x5555555555555555,%rdx
      cf:	55 55 55
      d2:	48 21 ca             	and    %rcx,%rdx
      d5:	48 01 d0             	add    %rdx,%rax
      d8:	48 89 45 f8          	mov    %rax,-0x8(%rbp)
      dc:	48 b8 33 33 33 33 33 	movabs $0x3333333333333333,%rax

  Followed by:

     144:	48 0f af c2          	imul   %rdx,%rax
     148:	48 c1 e8 38          	shr    $0x38,%rax
     14c:	5d                   	pop    %rbp
     14d:	c3                   	ret

  With the usual x86 ABI preamble and postamble. So this should be an
  extremely cheap function to compute.

- But, the earlier bitmap_popcount() call in
  reuse_partial_packfile_from_bitmap() is not necessary, since we only
  care whether or not there are _any_ bits set in the bitmap, not how
  many of them there are.

  So we could write something like `bitmap_empty(reuse)` instead, which
  would be much cheaper (again, not that I think we'll notice this
  either way, but throwing away the result of bitmap_popcount() and
  calling it twice does leave me a little unsatisfied).

So I think we could reasonably do something like:

--- 8< ---
diff --git a/ewah/bitmap.c b/ewah/bitmap.c
index 7b525b1ecd..ac7e0af622 100644
--- a/ewah/bitmap.c
+++ b/ewah/bitmap.c
@@ -169,6 +169,15 @@ size_t bitmap_popcount(struct bitmap *self)
 	return count;
 }

+int bitmap_is_empty(struct bitmap *self)
+{
+	size_t i;
+	for (i = 0; i < self->word_alloc; i++)
+		if (self->words[i])
+			return 0;
+	return 1;
+}
+
 int bitmap_equals(struct bitmap *self, struct bitmap *other)
 {
 	struct bitmap *big, *small;
diff --git a/ewah/ewok.h b/ewah/ewok.h
index 7eb8b9b630..c11d76c6f3 100644
--- a/ewah/ewok.h
+++ b/ewah/ewok.h
@@ -189,5 +189,6 @@ void bitmap_or_ewah(struct bitmap *self, struct ewah_bitmap *other);
 void bitmap_or(struct bitmap *self, const struct bitmap *other);

 size_t bitmap_popcount(struct bitmap *self);
+int bitmap_is_empty(struct bitmap *self);

 #endif
diff --git a/pack-bitmap.c b/pack-bitmap.c
index 614fc09a4e..e50b322779 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -2045,7 +2045,7 @@ void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,

 	reuse_partial_packfile_from_bitmap_1(bitmap_git, packs, reuse);

-	if (!bitmap_popcount(reuse)) {
+	if (bitmap_is_empty(reuse)) {
 		free(packs);
 		bitmap_free(reuse);
 		return;
--- >8 ---

Thanks,
Taylor

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* Re: [PATCH 09/24] repack: implement `--extend-disjoint` mode
  2023-12-07 13:13   ` Patrick Steinhardt
@ 2023-12-07 20:28     ` Taylor Blau
  2023-12-08  8:19       ` Patrick Steinhardt
  0 siblings, 1 reply; 107+ messages in thread
From: Taylor Blau @ 2023-12-07 20:28 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Jeff King, Junio C Hamano

On Thu, Dec 07, 2023 at 02:13:08PM +0100, Patrick Steinhardt wrote:
> > The gotchas mostly have to do with making sure that we do not generate a
> > disjoint pack in the following scenarios:
>
> Okay, let me verify whether I understand the reasons:
>
> >   - promisor packs
>
> Which is because promisor packs actually don't contain any objects?

Right.

> >   - cruft packs (which may necessarily need to include an object from a
> >     disjoint pack in order to freshen it in certain circumstances)
>
> This one took me a while to figure out. If we'd mark crufts as disjoint,
> then it would mean that new packfiles cannot be marked as disjoint if
> objects which were previously unreachable do become reachable again.
> So we'd be pessimizing packfiles for live objects in favor of others
> which aren't.

Yeah, that's right, too. There are a couple of cases where more than one
cruft pack may contain the same object, one of them being the
flip-flopping between reachable and unreachable as you suggest above.
Another is that you have a non-prunable unreachable object which is
already in a cruft pack. If the object's mtime gets updated (and still
cannot be pruned), we'll end up freshening the object loose, and then
packing it again (with the more recent mtime) into a new cruft pack.

That aside, I actually think that there are ways to mark cruft packs
disjoint. But they're complicated, and moreover, I don't think you'd
ever *want* to mark a cruft pack as disjoint. Cruft packs usually
contain garbage, which is unlikely to be useful to any fetches/clones.

If we did mark them as disjoint, it would mean that we could reuse
verbatim sections of the cruft pack in our output, but we would likely
end up with very few such sections.

> >   - all-into-one repacks without '-d'
>
> Because here the old packfiles that this would make redundant aren't
> deleted and thus the objects are duplicate now.

Yep.

> > Otherwise, we mark which packs were created as disjoint by using a new
> > bit in the `generated_pack_data` struct, and then marking those pack(s)
> > as disjoint accordingly when generating the MIDX. Non-deleted packs
> > which are marked as disjoint are retained as such by passing the
> > equivalent of `--retain-disjoint` when calling the MIDX API to update
> > the MIDX.
>
> Okay. I had a bit of trouble to sift through the various different
> flags like "--retain-disjoint", "--extend-disjoint", "--ignore-disjoint"
> and so on. But well, they do different things and it's been a few days
> since I've reviewed the preceding patches, so this is probably fine.

Yeah, I am definitely open to better naming conventions here? I figured
that:

  - --retain-disjoint was a good name for the MIDX option, since it is
    retaining existing disjoint packs in the new MIDX
  - --extend-disjoint was a good name for the repack option, since it is
    extending the set of disjoint packs
  - --ignore-disjoint was a good name for the pack-objects option, since
    it is ignoring objects in disjoint packs

Writing this out, I think that you could make an argument that
`--exclude-disjoint` is a better name for the last option. So I'm
definitely open to suggestions here, but I don't want to get too bogged
down on command-line option naming (so long as we're all reasonably
happy with the result).

> One thing I wondered: do we need to consider the `-l` flag? When using
> an alternate object directory it is totally feasible that the alternate
> may be creating new disjoint packages without us knowing, and thus we
> may not be able to guarantee the disjoint property anymore.

I don't think so. We'd only care about one direction of this (that
alternates do not create disjoint packs which overlap with ours, instead
of the other way around), but since we don't put non-local packs in the
MIDX, I think we're OK.

I suppose that you might run into trouble if you use the chained MIDX
thing (via its `->next` pointer). I haven't used that feature myself, so
I'd have to play around with it.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 10/24] pack-bitmap: pass `bitmapped_pack` struct to pack-reuse functions
  2023-12-07 13:13   ` Patrick Steinhardt
@ 2023-12-07 20:34     ` Taylor Blau
  2023-12-08  8:19       ` Patrick Steinhardt
  0 siblings, 1 reply; 107+ messages in thread
From: Taylor Blau @ 2023-12-07 20:34 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Jeff King, Junio C Hamano

On Thu, Dec 07, 2023 at 02:13:13PM +0100, Patrick Steinhardt wrote:
> > +	if (bitmap_is_midx(bitmap_git)) {
> > +		for (i = 0; i < bitmap_git->midx->num_packs; i++) {
> > +			struct bitmapped_pack pack;
> > +			if (nth_bitmapped_pack(r, bitmap_git->midx, &pack, i) < 0) {
> > +				warning(_("unable to load pack: '%s', disabling pack-reuse"),
> > +					bitmap_git->midx->pack_names[i]);
> > +				free(packs);
> > +				return -1;
> > +			}
> > +			if (!pack.bitmap_nr)
> > +				continue; /* no objects from this pack */
> > +			if (pack.bitmap_pos)
> > +				continue; /* not preferred pack */
> > +
> > +			ALLOC_GROW(packs, packs_nr + 1, packs_alloc);
> > +			memcpy(&packs[packs_nr++], &pack, sizeof(pack));
> > +
> > +			objects_nr += pack.p->num_objects;
> > +		}
> > +	} else {
> > +		ALLOC_GROW(packs, packs_nr + 1, packs_alloc);
> > +
> > +		packs[packs_nr].p = bitmap_git->pack;
> > +		packs[packs_nr].bitmap_pos = 0;
> > +		packs[packs_nr].bitmap_nr = bitmap_git->pack->num_objects;
> > +		packs[packs_nr].disjoint = 1;
> > +
> > +		objects_nr = packs[packs_nr++].p->num_objects;
> > +	}
> > +
> > +	word_alloc = objects_nr / BITS_IN_EWORD;
> > +	if (objects_nr % BITS_IN_EWORD)
> > +		word_alloc++;
> > +	reuse = bitmap_word_alloc(word_alloc);
> > +
> > +	if (packs_nr != 1)
> > +		BUG("pack reuse not yet implemented for multiple packs");
>
> Can't it happen that we have no pack here? In the MIDX-case we skip all
> packs that either do not have a bitmap or are not preferred. So does it
> mean that in reverse, every preferred packfile must have a a bitmap? I'd
> think that to not be true in case bitmaps are turned off.

It's subtle, but this state is indeed not possible. If we have a MIDX
and it has a bitmap, we know that there is at least one object at least
one pack.

On the "at least one object front", that check was added in eb57277ba3
(midx: prevent writing a .bitmap without any objects, 2022-02-09). And
we know that our preferred pack (either explicitly given or the one we
infer automatically) is non-empty, via the check added in 5d3cd09a80
(midx: reject empty `--preferred-pack`'s, 2021-08-31).

(As a fun/non-fun aside, looking these up gave me some serious deja-vu
and reminded me of how painful discovering and fixing those bugs was!)

So we're OK here. We could add a comment which captures what I wrote
above here, but since this is a temporary state (and we're going to
change how we select which packs are reuse candidates in a later patch),
I think it's OK to avoid (but please let me know if you feel differently).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 14/24] pack-objects: keep track of `pack_start` for each reuse pack
  2023-12-07 13:13   ` Patrick Steinhardt
@ 2023-12-07 20:43     ` Taylor Blau
  0 siblings, 0 replies; 107+ messages in thread
From: Taylor Blau @ 2023-12-07 20:43 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Jeff King, Junio C Hamano

On Thu, Dec 07, 2023 at 02:13:24PM +0100, Patrick Steinhardt wrote:
> > In order to compute this value correctly, we need to know not only where
> > we are in the packfile we're assembling (with `hashfile_total(f)`) but
> > also the position of the first byte of the packfile that we are
> > currently reusing.
> >
> > Together, these two allow us to compute the reused chunk's offset
> > difference relative to the start of the reused pack, as desired.
>
> Hm. I'm not quite sure I fully understand the motivation here. Is this
> something that was broken all along? Why does it become a problem now?
> Sorry if I'm missing the obvious here.

No worries, I should have explained this better. Indeed we do have to
worry about patching deltas today when reusing objects from a pack. But
we have to extend the implementation in order to perform reuse over
multiple packs when any of them (excluding the first, which would work
with the existing logic) have delta/base pairs on either side of a gap.

I'll try to make it a little clearer, thanks for pointing that out.

> > @@ -1126,10 +1129,12 @@ static void write_reused_pack(struct packed_git *reuse_packfile,
> >  {
> >  	size_t i = 0;
> >  	uint32_t offset;
> > +	off_t pack_start = hashfile_total(f) - sizeof(struct pack_header);
>
> Given that this patch in its current state doesn't seem to do anything
> yet, am I right in assuming that `hashfile_total(f) - sizeof(struct
> pack_header)` is always expected to be zero for now?

Yep, that's right.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 16/24] pack-objects: prepare `write_reused_pack()` for multi-pack reuse
  2023-12-07 13:13   ` Patrick Steinhardt
@ 2023-12-07 20:47     ` Taylor Blau
  0 siblings, 0 replies; 107+ messages in thread
From: Taylor Blau @ 2023-12-07 20:47 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Jeff King, Junio C Hamano

On Thu, Dec 07, 2023 at 02:13:29PM +0100, Patrick Steinhardt wrote:
> On Tue, Nov 28, 2023 at 02:08:37PM -0500, Taylor Blau wrote:
> > The function `write_reused_pack()` within `builtin/pack-objects.c` is
> > responsible for performing pack-reuse on a single pack, and has two main
> > functions:
> >
> >   - it dispatches a call to `write_reused_pack_verbatim()` to see if we
> >     can reuse portions of the packfile in whole-word chunks
> >
> >   - for any remaining objects (that is, any objects that appear after
> >     the first "gap" in the bitmap), call write_reused_pack_one() on that
> >     object to record it for reuse.
> >
> > Prepare this function for multi-pack reuse by removing the assumption
> > that the bit position corresponding to the first object being reused
> > from a given pack may not be at bit position zero.
>
> Is this double-negation intended? We remove the assumption that we start
> reading at position zero, but the paragraph here states that we remove
> the assumption that we do _not_ start at bit zero.

Oops, great catch. I'll s/may not/must in the last paragraph to clarify.

> I'll stop reviewing here and will come back to this series somewhen next
> week.

Thanks as usual for your review -- I appreciate you digging through this
rather dense series.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 09/24] repack: implement `--extend-disjoint` mode
  2023-12-07 20:28     ` Taylor Blau
@ 2023-12-08  8:19       ` Patrick Steinhardt
  2023-12-08 22:48         ` Taylor Blau
  0 siblings, 1 reply; 107+ messages in thread
From: Patrick Steinhardt @ 2023-12-08  8:19 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Junio C Hamano

[-- Attachment #1: Type: text/plain, Size: 4326 bytes --]

On Thu, Dec 07, 2023 at 03:28:18PM -0500, Taylor Blau wrote:
> On Thu, Dec 07, 2023 at 02:13:08PM +0100, Patrick Steinhardt wrote:
> > >   - cruft packs (which may necessarily need to include an object from a
> > >     disjoint pack in order to freshen it in certain circumstances)
> >
> > This one took me a while to figure out. If we'd mark crufts as disjoint,
> > then it would mean that new packfiles cannot be marked as disjoint if
> > objects which were previously unreachable do become reachable again.
> > So we'd be pessimizing packfiles for live objects in favor of others
> > which aren't.
> 
> Yeah, that's right, too. There are a couple of cases where more than one
> cruft pack may contain the same object, one of them being the
> flip-flopping between reachable and unreachable as you suggest above.
> Another is that you have a non-prunable unreachable object which is
> already in a cruft pack. If the object's mtime gets updated (and still
> cannot be pruned), we'll end up freshening the object loose, and then
> packing it again (with the more recent mtime) into a new cruft pack.
> 
> That aside, I actually think that there are ways to mark cruft packs
> disjoint. But they're complicated, and moreover, I don't think you'd
> ever *want* to mark a cruft pack as disjoint. Cruft packs usually
> contain garbage, which is unlikely to be useful to any fetches/clones.
> 
> If we did mark them as disjoint, it would mean that we could reuse
> verbatim sections of the cruft pack in our output, but we would likely
> end up with very few such sections.

Makes sense. It also doesn't feel worth it to introduce additional
complexity for objects that for most of the part wouldn't ever be served
on a fetch anyway.

[snip]
> > Okay. I had a bit of trouble to sift through the various different
> > flags like "--retain-disjoint", "--extend-disjoint", "--ignore-disjoint"
> > and so on. But well, they do different things and it's been a few days
> > since I've reviewed the preceding patches, so this is probably fine.
> 
> Yeah, I am definitely open to better naming conventions here? I figured
> that:
> 
>   - --retain-disjoint was a good name for the MIDX option, since it is
>     retaining existing disjoint packs in the new MIDX
>   - --extend-disjoint was a good name for the repack option, since it is
>     extending the set of disjoint packs
>   - --ignore-disjoint was a good name for the pack-objects option, since
>     it is ignoring objects in disjoint packs
> 
> Writing this out, I think that you could make an argument that
> `--exclude-disjoint` is a better name for the last option. So I'm
> definitely open to suggestions here, but I don't want to get too bogged
> down on command-line option naming (so long as we're all reasonably
> happy with the result).

Yeah, as said, I don't mind it too much. It's a complex area and the
flags all do different things, so it's expected that you may have to do
some research on what exactly they do. That being said, I do like your
proposed `--exclude-disjoint` a lot more than `--ignore-disjoint`.

> > One thing I wondered: do we need to consider the `-l` flag? When using
> > an alternate object directory it is totally feasible that the alternate
> > may be creating new disjoint packages without us knowing, and thus we
> > may not be able to guarantee the disjoint property anymore.
> 
> I don't think so. We'd only care about one direction of this (that
> alternates do not create disjoint packs which overlap with ours, instead
> of the other way around), but since we don't put non-local packs in the
> MIDX, I think we're OK.
> 
> I suppose that you might run into trouble if you use the chained MIDX
> thing (via its `->next` pointer). I haven't used that feature myself, so
> I'd have to play around with it.

We do use this feature at GitLab for forks, where forks connect to a
common alternate object directory to deduplicate objects. As both the
fork repository and the alternate object directory use an MIDX I think
they would be set up exactly like that.

I guess the only really viable solution here is to ignore disjoint packs
in the main repo that connects to the alternate in the case where the
alternate has any disjoint packs itself.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 10/24] pack-bitmap: pass `bitmapped_pack` struct to pack-reuse functions
  2023-12-07 20:34     ` Taylor Blau
@ 2023-12-08  8:19       ` Patrick Steinhardt
  0 siblings, 0 replies; 107+ messages in thread
From: Patrick Steinhardt @ 2023-12-08  8:19 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Junio C Hamano

[-- Attachment #1: Type: text/plain, Size: 1465 bytes --]

On Thu, Dec 07, 2023 at 03:34:52PM -0500, Taylor Blau wrote:
> On Thu, Dec 07, 2023 at 02:13:13PM +0100, Patrick Steinhardt wrote:
[snip]
> > Can't it happen that we have no pack here? In the MIDX-case we skip all
> > packs that either do not have a bitmap or are not preferred. So does it
> > mean that in reverse, every preferred packfile must have a a bitmap? I'd
> > think that to not be true in case bitmaps are turned off.
> 
> It's subtle, but this state is indeed not possible. If we have a MIDX
> and it has a bitmap, we know that there is at least one object at least
> one pack.
> 
> On the "at least one object front", that check was added in eb57277ba3
> (midx: prevent writing a .bitmap without any objects, 2022-02-09). And
> we know that our preferred pack (either explicitly given or the one we
> infer automatically) is non-empty, via the check added in 5d3cd09a80
> (midx: reject empty `--preferred-pack`'s, 2021-08-31).
> 
> (As a fun/non-fun aside, looking these up gave me some serious deja-vu
> and reminded me of how painful discovering and fixing those bugs was!)
> 
> So we're OK here. We could add a comment which captures what I wrote
> above here, but since this is a temporary state (and we're going to
> change how we select which packs are reuse candidates in a later patch),
> I think it's OK to avoid (but please let me know if you feel differently).

Makes sense, thanks for the explanation!

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 09/24] repack: implement `--extend-disjoint` mode
  2023-12-08  8:19       ` Patrick Steinhardt
@ 2023-12-08 22:48         ` Taylor Blau
  2023-12-11  8:18           ` Patrick Steinhardt
  0 siblings, 1 reply; 107+ messages in thread
From: Taylor Blau @ 2023-12-08 22:48 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Jeff King, Junio C Hamano

On Fri, Dec 08, 2023 at 09:19:25AM +0100, Patrick Steinhardt wrote:
> > Writing this out, I think that you could make an argument that
> > `--exclude-disjoint` is a better name for the last option. So I'm
> > definitely open to suggestions here, but I don't want to get too bogged
> > down on command-line option naming (so long as we're all reasonably
> > happy with the result).
>
> Yeah, as said, I don't mind it too much. It's a complex area and the
> flags all do different things, so it's expected that you may have to do
> some research on what exactly they do. That being said, I do like your
> proposed `--exclude-disjoint` a lot more than `--ignore-disjoint`.

I think that's fair, I renamed the option to be "--exclude-disjoint"
instead of "--ignore-disjoint" for any subsequent round(s) of this
series.

> > > One thing I wondered: do we need to consider the `-l` flag? When using
> > > an alternate object directory it is totally feasible that the alternate
> > > may be creating new disjoint packages without us knowing, and thus we
> > > may not be able to guarantee the disjoint property anymore.
> >
> > I don't think so. We'd only care about one direction of this (that
> > alternates do not create disjoint packs which overlap with ours, instead
> > of the other way around), but since we don't put non-local packs in the
> > MIDX, I think we're OK.
> >
> > I suppose that you might run into trouble if you use the chained MIDX
> > thing (via its `->next` pointer). I haven't used that feature myself, so
> > I'd have to play around with it.
>
> We do use this feature at GitLab for forks, where forks connect to a
> common alternate object directory to deduplicate objects. As both the
> fork repository and the alternate object directory use an MIDX I think
> they would be set up exactly like that.

Yep, that's right. I wasn't sure whether or not this feature had been
used extensively in production or not (we don't use it at GitHub, since
objects only live in their fork repositories for a short while before
moving to the fork network repository).

> I guess the only really viable solution here is to ignore disjoint packs
> in the main repo that connects to the alternate in the case where the
> alternate has any disjoint packs itself.

I think the behavior you'd get here is that we'd only look for disjoint
packs in the first MIDX in the chain (which is always the local one),
and we'd only recognizes packs from that MIDX as being potentially
disjoint.

If you have the bulk of your repositories in the alternate, then I think
you might want to consider how we combine the two. My sense is that
you'd want to be disjoint with respect to anything downstream of you.

Whether or not this is a feature that you/others need, I definitely
think we should leave it out of this series, since I am (a) fairly
certain that this is possible to do, and (b) already feel like this
series on its own is complicated enough.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 05/24] midx: implement `DISP` chunk
  2023-12-05 19:26     ` Taylor Blau
@ 2023-12-09  1:40       ` Junio C Hamano
  2023-12-09  2:30         ` Taylor Blau
  0 siblings, 1 reply; 107+ messages in thread
From: Junio C Hamano @ 2023-12-09  1:40 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Patrick Steinhardt

Taylor Blau <me@ttaylorr.com> writes:

> Hopefully you're satisfied with the way things are split up and
> organized currently, but if you have suggestions on other ways I could
> slice or dice this, please let me know.

I did wonder how expensive to recompute and validate the "distinct"
information (in other words, is it too expensive for the consumers
of an existing midx file to see which packs are distinct on demand
before they stream contents out of the underlying packs?), as the
way the packs are marked as distinct looked rather error prone (you
can very easily list packfiles with overlaps with "+" prefix and the
DISK chunk writer does not even notice that you lied to it).  As long
as "git fsck" catches when two packs that are marked as distinct share
an object, that is OK, but the arrangement did look rather brittle
to me.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 05/24] midx: implement `DISP` chunk
  2023-12-09  1:40       ` Junio C Hamano
@ 2023-12-09  2:30         ` Taylor Blau
  2023-12-12  8:03           ` Jeff King
  0 siblings, 1 reply; 107+ messages in thread
From: Taylor Blau @ 2023-12-09  2:30 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Jeff King, Patrick Steinhardt

On Fri, Dec 08, 2023 at 05:40:29PM -0800, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > Hopefully you're satisfied with the way things are split up and
> > organized currently, but if you have suggestions on other ways I could
> > slice or dice this, please let me know.
>
> I did wonder how expensive to recompute and validate the "distinct"
> information (in other words, is it too expensive for the consumers
> of an existing midx file to see which packs are distinct on demand
> before they stream contents out of the underlying packs?), as the
> way the packs are marked as distinct looked rather error prone (you
> can very easily list packfiles with overlaps with "+" prefix and the
> DISK chunk writer does not even notice that you lied to it).  As long
> as "git fsck" catches when two packs that are marked as distinct share
> an object, that is OK, but the arrangement did look rather brittle
> to me.

It's likely too expensive to do on the reading side for every
pack-objects operation or MIDX load. But we do check this property when
we write the MIDX, see these lines from midx.c::get_sorted_entries():

    for (cur_object = 0; cur_object < fanout.nr; cur_object++) {
      struct pack_midx_entry *ours = &fanout.entries[cur_object];
      if (cur_object) {
        struct pack_midx_entry *prev = &fanout.entries[cur_object - 1];
        if (oideq(&prev->oid, &ours->oid)) {
          if (prev->disjoint && ours->disjoint)
            die(_("duplicate object '%s' among disjoint packs '%s', '%s'"),
                oid_to_hex(&prev->oid),
                info[prev->pack_int_id].pack_name,
                info[ours->pack_int_id].pack_name);
          continue;
        }
      }

This series doesn't yet have a corresponding step in the fsck builtin,
but I will investigate adding one.

I'm happy to include it in a subsequent round here, but I worry that
this series is already on the verge of being too complex as-is, so it
may be nice as a follow-up, too.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 09/24] repack: implement `--extend-disjoint` mode
  2023-12-08 22:48         ` Taylor Blau
@ 2023-12-11  8:18           ` Patrick Steinhardt
  2023-12-11 19:59             ` Taylor Blau
  0 siblings, 1 reply; 107+ messages in thread
From: Patrick Steinhardt @ 2023-12-11  8:18 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Junio C Hamano

[-- Attachment #1: Type: text/plain, Size: 4189 bytes --]

On Fri, Dec 08, 2023 at 05:48:28PM -0500, Taylor Blau wrote:
> On Fri, Dec 08, 2023 at 09:19:25AM +0100, Patrick Steinhardt wrote:
> > > > One thing I wondered: do we need to consider the `-l` flag? When using
> > > > an alternate object directory it is totally feasible that the alternate
> > > > may be creating new disjoint packages without us knowing, and thus we
> > > > may not be able to guarantee the disjoint property anymore.
> > >
> > > I don't think so. We'd only care about one direction of this (that
> > > alternates do not create disjoint packs which overlap with ours, instead
> > > of the other way around), but since we don't put non-local packs in the
> > > MIDX, I think we're OK.
> > >
> > > I suppose that you might run into trouble if you use the chained MIDX
> > > thing (via its `->next` pointer). I haven't used that feature myself, so
> > > I'd have to play around with it.
> >
> > We do use this feature at GitLab for forks, where forks connect to a
> > common alternate object directory to deduplicate objects. As both the
> > fork repository and the alternate object directory use an MIDX I think
> > they would be set up exactly like that.
> 
> Yep, that's right. I wasn't sure whether or not this feature had been
> used extensively in production or not (we don't use it at GitHub, since
> objects only live in their fork repositories for a short while before
> moving to the fork network repository).
> 
> > I guess the only really viable solution here is to ignore disjoint packs
> > in the main repo that connects to the alternate in the case where the
> > alternate has any disjoint packs itself.
> 
> I think the behavior you'd get here is that we'd only look for disjoint
> packs in the first MIDX in the chain (which is always the local one),
> and we'd only recognizes packs from that MIDX as being potentially
> disjoint.
> 
> If you have the bulk of your repositories in the alternate, then I think
> you might want to consider how we combine the two. 

Yes, in the general case the bulk of objects is indeed contained in the
alternate.

> My sense is that you'd want to be disjoint with respect to anything
> downstream of you.

Ideally yes, but this is unfortunately not easily achievable in the
general case. It's one of the many painpoints that alternates bring with
them.

Suppose two forks A and B are connected to the same alternate. Both A
and B now gain the same set of objects via whatever means. At this point
these objects can be stored in disjoint packs in each of the repos as
they are not yet deduplicated via the alternate. But if you were to pull
objects from either A or B into the alternate then you cannot ensure
disjointedness at all anymore because you would first have to repack
objects in both A and B.

For two forks this might still seem manageable. But as soon as your fork
network grows larger it's clear that this becomes almost impossible to
do. So ultimately, I don't see an alternative to ignoring disjointedness
bits in either of the two linked-together repos.

> Whether or not this is a feature that you/others need, I definitely
> think we should leave it out of this series, since I am (a) fairly
> certain that this is possible to do, and (b) already feel like this
> series on its own is complicated enough.

I'm perfectly fine if we say that the benefits of your patch series
cannot yet be applied to repositories with alternates. But from my point
of view it's a requirement that this patch series does not silently
break this usecase due to Git starting to generate disjointed packs
where it cannot ensure that the disjointedness property holds.

As I haven't yet read through this whole patch series, the question is
boils down to whether the end result is opt-in or opt-out. If it was
opt-out then I could see the above usecase breaking silently. If it was
opt-in then things should be fine and we can address this ommission in a
follow up patch series. We at GitLab would definitely be interested in
helping out with this given that it directly affects us and that the
demonstrated savings seem very promising.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 09/24] repack: implement `--extend-disjoint` mode
  2023-12-11  8:18           ` Patrick Steinhardt
@ 2023-12-11 19:59             ` Taylor Blau
  0 siblings, 0 replies; 107+ messages in thread
From: Taylor Blau @ 2023-12-11 19:59 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Jeff King, Junio C Hamano

On Mon, Dec 11, 2023 at 09:18:32AM +0100, Patrick Steinhardt wrote:
> > If you have the bulk of your repositories in the alternate, then I think
> > you might want to consider how we combine the two.
>
> Yes, in the general case the bulk of objects is indeed contained in the
> alternate.

I thought about this for a while this morning, and think that while it
is possible, I'm not sure I can think of a compelling use-case where
you'd want to reuse objects from packs across an alternate boundary.

On the "I think it's possible front":

The challenge is making sure that the set of disjoint packs among each
object store is globally disjoint in one direction along the alternate
path. I think the rule would require you to honor the disjointed-ness of
any packs in alternate(s) you might have when constructing new disjoint
packs.

So if repository fork.git is an alternate of network.git (and both had
long-lived MIDXs), network.git is free to make any set of disjoint packs
it chooses, and fork.git can only create disjoint packs which are
disjoint with respect to (a) the other disjoint packs in fork.git (if
any), and (b) the disjoint packs in network.git (and recursively for any
repositories that network.git is an alternate of in the general case).

Now on the "I can't think of a compelling use-case front":

I think the only reason you'd want to be able to reuse objects from
MIDXs across the alternates boundary is if you have MIDX bitmaps in both
the repository and its alternate. Indeed, the only time that we kick in
pack-reuse in general is when we have a bitmap, so in order to reuse
objects from both the repo and its alternate, you'd have to have a
bitmap in both repositories.

But having a MIDX bitmap means that any packs in the MIDX for which
you're generating a bitmap have to be closed over object reachability.
So unless the repository and its alternate have totally distinct lines
of history (in which case, I'm not sure you would want to share objects
between the two in the first place), any pack you bitmap in the child
repository fundamentally couldn't be disjoint with respect to its
parent.

This is because if it were to be disjoint, it would have to be repacked
with '-l' (or some equivalent '-l'-like flag that only ignores non-local
packs which are marked as disjoint). But if you exclude those objects
and any one (or more) of them is reachable from some object(s) you
didn't exclude, you wouldn't be able to generate a bitmap in the first
place.

It's very possible that there's something about your setup that I'm not
fully grokking, but I don't think in general this is something that we'd
want to do (even if it is theoretically possible).

> > Whether or not this is a feature that you/others need, I definitely
> > think we should leave it out of this series, since I am (a) fairly
> > certain that this is possible to do, and (b) already feel like this
> > series on its own is complicated enough.
>
> I'm perfectly fine if we say that the benefits of your patch series
> cannot yet be applied to repositories with alternates. But from my point
> of view it's a requirement that this patch series does not silently
> break this usecase due to Git starting to generate disjointed packs
> where it cannot ensure that the disjointedness property holds.

I think one thing you could reasonably do is use *only* the non-local
MIDX bitmaps when doing pack reuse.

Currently we'll use the first MIDX we find, which is guaranteed to be
the local one, if it exists. This was the case before this series, and
this series does not change that behavior. Unless you had a pack bitmap
in the alternated repository (which I think is unlikely, since it would
require a full reachability closure, thus eliminating any de-duplication
benefits you'd otherwise get when using alternates), you'd be find
before and after this series.

> As I haven't yet read through this whole patch series, the question is
> boils down to whether the end result is opt-in or opt-out. If it was
> opt-out then I could see the above usecase breaking silently. If it was
> opt-in then things should be fine and we can address this ommission in a
> follow up patch series. We at GitLab would definitely be interested in
> helping out with this given that it directly affects us and that the
> demonstrated savings seem very promising.

The end result is opt-in, you have to change the `pack.allowPackReuse`
configuration from its default value of "true" (or the alternate
spelling taught in this series, "single") to "multi" in order to enable
the new behavior.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 02/24] pack-bitmap-write: deep-clear the `bb_commit` slab
  2023-11-28 19:07 ` [PATCH 02/24] pack-bitmap-write: deep-clear the `bb_commit` slab Taylor Blau
  2023-11-30 10:18   ` Patrick Steinhardt
@ 2023-12-12  7:04   ` Jeff King
  2023-12-12 16:48     ` Taylor Blau
  1 sibling, 1 reply; 107+ messages in thread
From: Jeff King @ 2023-12-12  7:04 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Patrick Steinhardt, Junio C Hamano

On Tue, Nov 28, 2023 at 02:07:59PM -0500, Taylor Blau wrote:

> +static void clear_bb_commit(struct bb_commit *commit)
> +{
> +	free(commit->reverse_edges);
> +	bitmap_free(commit->commit_mask);
> +	bitmap_free(commit->bitmap);
> +}

I think these bitmaps may sometimes be NULL. But double-checking
bitmap_free(), it sensibly is noop when passed NULL. So this look good
to me.

-Peff

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 03/24] pack-bitmap: plug leak in find_objects()
  2023-11-28 19:08 ` [PATCH 03/24] pack-bitmap: plug leak in find_objects() Taylor Blau
@ 2023-12-12  7:04   ` Jeff King
  0 siblings, 0 replies; 107+ messages in thread
From: Jeff King @ 2023-12-12  7:04 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Patrick Steinhardt, Junio C Hamano

On Tue, Nov 28, 2023 at 02:08:02PM -0500, Taylor Blau wrote:

> The `find_objects()` function creates an object_list for any tips of the
> reachability query which do not have corresponding bitmaps.
> 
> The object_list is not used outside of `find_objects()`, but we never
> free it with `object_list_free()`, resulting in a leak. Let's plug that
> leak by calling `object_list_free()`, which results in t6113 becoming
> leak-free.

Makes sense.

> @@ -1280,6 +1280,8 @@ static struct bitmap *find_objects(struct bitmap_index *bitmap_git,
>  		base = fill_in_bitmap(bitmap_git, revs, base, seen);
>  	}
>  
> +	object_list_free(&not_mapped);
> +
>  	return base;
>  }

There's an extra return earlier in the function, but it triggers only
when not_mapped is NULL. So this covers all cases. Good.

> +++ b/t/t6113-rev-list-bitmap-filters.sh
> [..]
> +TEST_PASSES_SANITIZE_LEAK=true

Yay. :)

-Peff

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 05/24] midx: implement `DISP` chunk
  2023-12-09  2:30         ` Taylor Blau
@ 2023-12-12  8:03           ` Jeff King
  2023-12-13 18:28             ` Taylor Blau
  0 siblings, 1 reply; 107+ messages in thread
From: Jeff King @ 2023-12-12  8:03 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Junio C Hamano, git, Patrick Steinhardt

On Fri, Dec 08, 2023 at 09:30:07PM -0500, Taylor Blau wrote:

> > I did wonder how expensive to recompute and validate the "distinct"
> > information (in other words, is it too expensive for the consumers
> > of an existing midx file to see which packs are distinct on demand
> > before they stream contents out of the underlying packs?), as the
> > way the packs are marked as distinct looked rather error prone (you
> > can very easily list packfiles with overlaps with "+" prefix and the
> > DISK chunk writer does not even notice that you lied to it).  As long
> > as "git fsck" catches when two packs that are marked as distinct share
> > an object, that is OK, but the arrangement did look rather brittle
> > to me.
> 
> It's likely too expensive to do on the reading side for every
> pack-objects operation or MIDX load.

This made me think a bit. Obviously we can check for disjointedness in
O(n log k), where n is the total number of objects and k is the number
of packs, by doing a k-merge of the sorted lists. But that's a
non-starter, because we may be serving a request that is much smaller
than all "n" objects (e.g., any small fetch, but also even clones when
the pack contains a bunch of irrelevant objects).

But we can relax our condition a bit. The packs only need to be disjoint
with respect to the set of objects that we are going to output (we'll
call that "m"). So at the very least, you can do O(mk) lookups (each one
itself "log n", of course). We know that the work is already going to
be linear in "m". In theory we want to generally keep "k" small, but
part of the point of using midx in this way is to let "k" grow a bit.
So that might turn out pretty bad in practice.

So let's take one more step back. One thing I didn't feel like I saw
either in this patch or the cover letter is exactly why we care about
disjointedness. IIRC, the main issue is making sure that for any object
X we send verbatim, it is either a non-delta or its delta base is
viable. And the two reasons I can think of for the base to be non-viable
are:

  1. We are not sending the base at all.

  2. The base is in another pack, and we are worried about creating a
     cycle (i.e., in pack A we have X as a delta against Y, and in pack
     B we have Y as a delta against X, and we send both deltas).

We already deal with (1) for the single-pack case by finding the base
object offset, converting it to a pack position, and then making sure
that position is also marked for verbatim reuse.

The interesting one is (2), which is new for midx (since a single pack
cannot have two copies of an object). But I'm not sure if it's possible.
The verbatim reuse code depends on using bitmaps in the first place. And
there is already a preference-order to the packs in the midx bitmaps.

That is, we'll choose all of the objects for pack A over any duplicates
in B, and duplicates from B over C, and so on. If we likewise try
verbatim reuse in that order, then we know that an object in pack A can
never have a base that is selected from pack B or C (because if they do
have duplicates, we'd have selected A's copy to put in the midx bitmap).
And likewise, a copy of an object in pack B will always have its base
either in A or B, but never in C.

So it kind of seems to me that this would "just work" if
try_partial_reuse() did something like for the midx case:

  - convert offset of base object into an object id hash using the pack
    revindex (similar to offset_to_pack_pos)

  - look up the object id in the midx to get a pack/offset combo

  - use the midx revindex to convert that into a bit position

  - check if that bit is marked for verbatim reuse

The assumption there is that in the second step, the midx has resolved
duplicates by putting all of pack A before pack B, and so on, as above.
It also assumes that we are trying verbatim reuse in the same order
(though a different order would not produce wrong results, it would only
result in less verbatim reuse).

All of which makes me think I'm missing some other case that is a
problem. While I wait for you to explain it, though, let's continue our
thought experiment for a moment.

If we assume that any cross-pack deltas are a problem, we could always
just skip them for verbatim reuse. That is, once we look up the object
id in the midx, we can see if it's in the same pack we're currently
processing. If not, we could punt and let the non-verbatim code paths
handle it as usual.

That still leaves the problem of realizing we skipped over a chunk of
the packfile (say we are pulling object X from pack B, and it is a delta
of Y, but we already decided to reuse Y from pack A). But the reuse code
already has to accommodate gaps. I think it happens naturally in
write_reused_pack_one(), where we feed the actual byte offsets into
record_reused_object(). You'd have to take some care not to go past gaps
in the blind copy that happens write_reused_pack_verbatim(). So you
might need to mark the first such gap somehow (if it's hard, I'd suggest
just skipping write_reused_pack_verbatim() entirely; it is a fairly
minor optimization compared to the rest of it).

And of course there's a bunch of hard work teaching all of those
functions about midx's and multiple packs in the first place, but you
already had to do all that in the latter part of your series, and it
would still be applicable.

OK, this is the part where you tell me what I'm missing. ;)

-Peff

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/24] pack-objects: multi-pack verbatim reuse
  2023-11-28 19:07 [PATCH 00/24] pack-objects: multi-pack verbatim reuse Taylor Blau
                   ` (24 preceding siblings ...)
  2023-11-30 10:18 ` [PATCH 00/24] pack-objects: multi-pack verbatim reuse Patrick Steinhardt
@ 2023-12-12  8:12 ` Jeff King
  2023-12-15 15:37   ` Taylor Blau
  2023-12-14 22:23 ` [PATCH v2 00/26] " Taylor Blau
  26 siblings, 1 reply; 107+ messages in thread
From: Jeff King @ 2023-12-12  8:12 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Patrick Steinhardt, Junio C Hamano

On Tue, Nov 28, 2023 at 02:07:54PM -0500, Taylor Blau wrote:

> Performing verbatim pack reuse naturally trades off between CPU time and
> the resulting pack size. In the above example, the single-pack reuse
> case produces a clone size of ~194 MB on my machine, while the
> multi-pack reuse case produces a clone size closer to ~266 MB, which is
> a ~37% increase in clone size.

Right, it's definitely a tradeoff. So taking a really big step back,
there are a few optimizations all tied up in the verbatim reuse code:

  1. in some cases we get to dump whole swaths of the on-disk packfile
     to the output, covering many objects with a few memcpy() calls.
     (This is still O(n), of course, but it's fewer instructions per
     object).

  2. any other reused objects have only a small-ish amount of work to
     fix up ofs deltas, handle gaps, and so on. We get to skip adding
     them to the packing_list struct (this saves some CPU, but also a
     lot of memory)

  3. we skip the delta search for these reused objects. This is where
     your big CPU / output size tradeoff comes into play, I'd think.

So my question is: how much of what you're seeing is from (1) and (2),
and how much is from (3)? Because there are other ways to trigger (3),
such as lowering the window size. For example, if you try your same
packing example with --window=0, how do the CPU and output size compare
to the results of your series? (I'd also check peak memory usage).

-Peff

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 02/24] pack-bitmap-write: deep-clear the `bb_commit` slab
  2023-12-12  7:04   ` Jeff King
@ 2023-12-12 16:48     ` Taylor Blau
  0 siblings, 0 replies; 107+ messages in thread
From: Taylor Blau @ 2023-12-12 16:48 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Patrick Steinhardt, Junio C Hamano

On Tue, Dec 12, 2023 at 02:04:06AM -0500, Jeff King wrote:
> On Tue, Nov 28, 2023 at 02:07:59PM -0500, Taylor Blau wrote:
>
> > +static void clear_bb_commit(struct bb_commit *commit)
> > +{
> > +	free(commit->reverse_edges);
> > +	bitmap_free(commit->commit_mask);
> > +	bitmap_free(commit->bitmap);
> > +}
>
> I think these bitmaps may sometimes be NULL. But double-checking
> bitmap_free(), it sensibly is noop when passed NULL. So this look good
> to me.

Yeah, bitamp_free() handles a NULL input correctly, so we can pass a
possibly-NULL `commit->commit_mask` or `commit->bitmap` argument and be
OK.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 05/24] midx: implement `DISP` chunk
  2023-12-12  8:03           ` Jeff King
@ 2023-12-13 18:28             ` Taylor Blau
  2023-12-13 19:20               ` Junio C Hamano
  0 siblings, 1 reply; 107+ messages in thread
From: Taylor Blau @ 2023-12-13 18:28 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git, Patrick Steinhardt

Hi Peff,

On Tue, Dec 12, 2023 at 03:03:32AM -0500, Jeff King wrote:
> [...]

Well this took me longer to respond to than I thought it would ;-).

> So it kind of seems to me that this would "just work" if
> try_partial_reuse() did something like for the midx case:
>
>   - convert offset of base object into an object id hash using the pack
>     revindex (similar to offset_to_pack_pos)
>
>   - look up the object id in the midx to get a pack/offset combo
>
>   - use the midx revindex to convert that into a bit position
>
>   - check if that bit is marked for verbatim reuse
>
> The assumption there is that in the second step, the midx has resolved
> duplicates by putting all of pack A before pack B, and so on, as above.
> It also assumes that we are trying verbatim reuse in the same order
> (though a different order would not produce wrong results, it would only
> result in less verbatim reuse).

After much thinking, I agree with your conclusion here. Which is great
news indeed, since it allows us to implement multi-pack reuse without
requiring that the candidate packs be disjoint with respect to their
objects.

Even though we have some protections in place for ensuring these packs'
disjointed-ness, I agree with Junio upthread that this is likely the
most fragile part of this series. That is, even though we check in
midx.c::get_sorted_entries() that the marked packs are disjoint, if we
miss something, the results would be rather bad. At best it would result
in sending a corrupt packfile to the client, and at worst it could
result in permanent data corruption if repacking a repository with
multi-pack reuse enabled.

> If we assume that any cross-pack deltas are a problem, we could always
> just skip them for verbatim reuse. That is, once we look up the object
> id in the midx, we can see if it's in the same pack we're currently
> processing. If not, we could punt and let the non-verbatim code paths
> handle it as usual.

Let me think aloud for a second, since it took me quite a lot of
thinking to fully wrap my head around this. Suppose we have two packs,
P1 and P2 where P1 has object A delta'd against B, and P2 has its own
copy of object B. If the MIDX chose the copy of B from P2, but also
decided to send P1 (which it chose from A), then if P1 is stored as an
OFS delta, we would be sending a corrupt packfile to the client.

I think there are a few things that we could reasonably do here:

  - Reject cross-pack deltas entirely (as you suggest). This is the
    easiest implementation choice in this already-complex series, and it
    doesn't paint us into a corner in the sense that we could relax
    these requirements in the future.

  - Turn any cross-pack deltas (which are stored as OFS_DELTAs) into
    REF_DELTAs. We already do this today when reusing an OFS_DELTA
    without `--delta-base-offset` enabled, so it's not a huge stretch to
    do the same for cross-pack deltas even when `--delta-base-offset` is
    enabled.

    This would work, but would obviously result in larger-than-necessary
    packs, as we in theory *could* represent these cross-pack deltas by
    patching an existing OFS_DELTA. But it's not clear how much that
    would matter in practice. I suspect it would have a lot to do with
    how you pack your repository in the first place.

  - Finally, we could patch OFS_DELTAs across packs in a similar fashion
    as we do today for OFS_DELTAs within a single pack on either side of
    a gap. This would result in the smallest packs of the three options
    here, but implementing this would be more involved.

    At minimum, you'd have to keep the reusable chunks list for all
    reused packs, not just the one we're currently processing. And you'd
    have to ensure that any bases which are a part of cross-pack deltas
    appear before the delta. I think this is possible to do, but would
    require assembling the reusable chunks list potentially in a
    different order than they appear in the source packs.

> And of course there's a bunch of hard work teaching all of those
> functions about midx's and multiple packs in the first place, but you
> already had to do all that in the latter part of your series, and it
> would still be applicable.

Yep, all of that is still a requirement here, and lives on in the
revised version of this series.

The naive implementation where we call try_partial_reuse() on every
object which is a candidate for reuse and check for cross-pack deltas
would work, but have poor performance. The reason is that we would be
doing a significant amount of cache-inefficient work to determine
whether or not the base for some delta/base-pair resides in the same
pack:

  - If you see a delta in some pack while processing a MIDX bitmap for
    reuse, you first have to figure out the location of its base in that
    same pack. (Note: this may or may not be the copy of the base object
    chosen by the MIDX).

  - To figure out whether or not the MIDX chose that copy, you would
    naively have to do something like:

      - Convert the base object's offset into a packfile position using
        the pack revindex.

      - Convert the base object's packfile position into an index
        position, which would then be used to obtain its OID.

      - Then binary search through the MIDX for that OID found in the
        previous step, filling out the MIDX entry for that object.

      - Toss out the cross-pack delta/base pair if the MIDX entry in the
        previous step indicates that the MIDX chose a copy of the base
        from a different pack than the one we're currently processing
        (i.e. where the delta resides).

That's rather inefficient. But, we can do better. Instead of going back
and forth through both the pack and MIDX's reverse index, we can simply
binary search in the MIDX's reverse index for the (pack_id, offset) pair
corresponding to the base.

If we find a match, then we know that the MIDX chose its copy of the
base object from the same pack, and we can reuse the delta/base-pair. If
we don't, then we know that the MIDX chose its copy of the base object
from a different pack, and we have to throw out the delta/base-pair.

This is a bit more involved than the naive implementation, but it's (a)
efficient, and (b) most of the code for it already exists in the form of
midx_to_pack_pos(), which implements a binary search over the MIDX
bitmap's pseudo-pack order.

With some light refactoring, we can repurpose this code to perform a
binary search for a given (pack, offset) pair instead of starting with a
MIDX lex position and converting it into the (pack, offset) pair. So
that works, and is what I've done in the revised version of this series.

There is one other snag, which is that we can no longer blindly reuse
whole-words from the reuse bitmap if we have non-disjoint packs. That
is, we can't do something like:

    while (pos < result->word_alloc && result->words[pos] == (eword_t)~0)
        pos++;

when processing anything but the first pack.

The reason is that we know the first pack has all duplicate object ties
broken in its favor, but we don't have the same guarantee for subsequent
packs. So we have to be more careful about which bits we reuse from
those subsequent packs, since we may inadvertently pick up a cross-pack
delta/base pair without inspecting it more closely.

As I mentioned, you can still perform this optimization over the first
pack, and I think that will be sufficient for most repositories. It's
not clear to me exactly how much this optimization is helping us in
contrast to all of the other work that pack-objects is doing, but that
is probably something worth measuring.

Thanks for the terrific suggestion. I'll clean up the results of trying
to implement it, and share it with the list soon (ideally before the end
of this week, after which I'm on vacation until the new year).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 05/24] midx: implement `DISP` chunk
  2023-12-13 18:28             ` Taylor Blau
@ 2023-12-13 19:20               ` Junio C Hamano
  0 siblings, 0 replies; 107+ messages in thread
From: Junio C Hamano @ 2023-12-13 19:20 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Jeff King, git, Patrick Steinhardt

Taylor Blau <me@ttaylorr.com> writes:

>   - Turn any cross-pack deltas (which are stored as OFS_DELTAs) into
>     REF_DELTAs. We already do this today when reusing an OFS_DELTA
>     without `--delta-base-offset` enabled, so it's not a huge stretch to
>     do the same for cross-pack deltas even when `--delta-base-offset` is
>     enabled.
>
>     This would work, but would obviously result in larger-than-necessary
>     packs, as we in theory *could* represent these cross-pack deltas by
>     patching an existing OFS_DELTA. But it's not clear how much that
>     would matter in practice. I suspect it would have a lot to do with
>     how you pack your repository in the first place.

I have to wonder if the cost of additional computation to see when
is safe to allow cross-pack delta can be kept reasonably low, as
you'd need to prove that you are not introducing a cycle by doing
so.  Compared to that, spending a dozen or so bytes for the offset
for rare cases of having to express such a cross-pack delta pair
would be much less worrisome to me.

> Thanks for the terrific suggestion. I'll clean up the results of trying
> to implement it, and share it with the list soon (ideally before the end
> of this week, after which I'm on vacation until the new year).

Good to hear that you'll be recharging soon.  Enjoy.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH v2 00/26] pack-objects: multi-pack verbatim reuse
  2023-11-28 19:07 [PATCH 00/24] pack-objects: multi-pack verbatim reuse Taylor Blau
                   ` (25 preceding siblings ...)
  2023-12-12  8:12 ` Jeff King
@ 2023-12-14 22:23 ` Taylor Blau
  2023-12-14 22:23   ` [PATCH v2 01/26] pack-objects: free packing_data in more places Taylor Blau
                     ` (26 more replies)
  26 siblings, 27 replies; 107+ messages in thread
From: Taylor Blau @ 2023-12-14 22:23 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano

This is a reroll of my series to implement multi-pack verbatim reuse.

Since last time, I implemented a suggestion from Peff in [1] that
requires reusable packs be disjoint with respect to the objects they are
sending, not their entire set of objects. This allows for us to
implement multi-pack reuse without the DISP chunk or the
`--extend-disjoint`, `--exclude-disjoint`, `--retain-disjoint`, etc
options from the previous round.

Besides that, much is unchanged from last time, and the bulk of the
remaining changes are from Patrick Steinhardt's review of the first 2/3
of this series.

Performance remains mostly unchanged since last time, and I was able to
achieve the following results in hyperfine on my worst-case scenario
test repository (broken into ~100 packs):

    $ hyperfine -L v single,multi -n '{v}-pack reuse' \
      'git.compile -c pack.allowPackReuse={v} pack-objects --revs --stdout --use-bitmap-index --delta-base-offset <in >/dev/null'
    Benchmark 1: single-pack reuse
      Time (mean ± σ):      6.234 s ±  0.026 s    [User: 43.733 s, System: 0.349 s]
      Range (min … max):    6.197 s …  6.293 s    10 runs

    Benchmark 2: multi-pack reuse
      Time (mean ± σ):     959.5 ms ±   2.4 ms    [User: 1133.8 ms, System: 36.3 ms]
      Range (min … max):   957.2 ms … 964.8 ms    10 runs

    Summary
      multi-pack reuse ran
        6.50 ± 0.03 times faster than single-pack reuse

As before, this series is still quite large, despite losing a fair bit
of code related to disjoint packs. After thinking on it some more, I
still couldn't come up with a satisfying way to break up this series, so
here it is presented in one big chunk. If others have ideas on how
better to present this series, please let me know.

In the meantime, thanks in advance for your review!

[1]: https://lore.kernel.org/git/20231212080332.GC1117953@coredump.intra.peff.net/

Taylor Blau (26):
  pack-objects: free packing_data in more places
  pack-bitmap-write: deep-clear the `bb_commit` slab
  pack-bitmap: plug leak in find_objects()
  midx: factor out `fill_pack_info()`
  midx: implement `BTMP` chunk
  midx: implement `midx_locate_pack()`
  pack-bitmap: pass `bitmapped_pack` struct to pack-reuse functions
  ewah: implement `bitmap_is_empty()`
  pack-bitmap: simplify `reuse_partial_packfile_from_bitmap()` signature
  pack-bitmap: return multiple packs via
    `reuse_partial_packfile_from_bitmap()`
  pack-objects: parameterize pack-reuse routines over a single pack
  pack-objects: keep track of `pack_start` for each reuse pack
  pack-objects: pass `bitmapped_pack`'s to pack-reuse functions
  pack-objects: prepare `write_reused_pack()` for multi-pack reuse
  pack-objects: prepare `write_reused_pack_verbatim()` for multi-pack
    reuse
  pack-objects: include number of packs reused in output
  git-compat-util.h: implement checked size_t to uint32_t conversion
  midx: implement `midx_preferred_pack()`
  pack-revindex: factor out `midx_key_to_pack_pos()` helper
  pack-revindex: implement `midx_pair_to_pack_pos()`
  pack-bitmap: prepare to mark objects from multiple packs for reuse
  pack-objects: add tracing for various packfile metrics
  t/test-lib-functions.sh: implement `test_trace2_data` helper
  pack-objects: allow setting `pack.allowPackReuse` to "single"
  pack-bitmap: enable reuse from all bitmapped packs
  t/perf: add performance tests for multi-pack reuse

 Documentation/config/pack.txt      |  16 +-
 Documentation/gitformat-pack.txt   |  76 +++++++
 builtin/pack-objects.c             | 169 +++++++++++----
 ewah/bitmap.c                      |   9 +
 ewah/ewok.h                        |   1 +
 git-compat-util.h                  |   9 +
 midx.c                             | 151 +++++++++++---
 midx.h                             |  12 +-
 pack-bitmap-write.c                |   9 +-
 pack-bitmap.c                      | 321 +++++++++++++++++++----------
 pack-bitmap.h                      |  19 +-
 pack-objects.c                     |  15 ++
 pack-objects.h                     |   1 +
 pack-revindex.c                    |  50 +++--
 pack-revindex.h                    |   3 +
 t/helper/test-read-midx.c          |  41 +++-
 t/perf/p5332-multi-pack-reuse.sh   |  81 ++++++++
 t/t5319-multi-pack-index.sh        |  35 ++++
 t/t5332-multi-pack-reuse.sh        | 203 ++++++++++++++++++
 t/t6113-rev-list-bitmap-filters.sh |   2 +
 t/test-lib-functions.sh            |  14 ++
 21 files changed, 1039 insertions(+), 198 deletions(-)
 create mode 100755 t/perf/p5332-multi-pack-reuse.sh
 create mode 100755 t/t5332-multi-pack-reuse.sh

Range-diff against v1:
 1:  101d6a2841 !  1:  7d65abfa1d pack-objects: free packing_data in more places
    @@ Commit message
     
         Since these structures contain allocated fields, failing to
         appropriately free() them results in a leak. Plug that leak by
    -    introducing a free_packing_data() function, and call it in the
    +    introducing a clear_packing_data() function, and call it in the
         appropriate spots.
     
         This is a fairly straightforward leak to plug, since none of the callers
    @@ builtin/pack-objects.c: int cmd_pack_objects(int argc, const char **argv, const
      			   reuse_packfile_objects);
      
      cleanup:
    -+	free_packing_data(&to_pack);
    ++	clear_packing_data(&to_pack);
      	list_objects_filter_release(&filter_options);
      	strvec_clear(&rp);
      
    @@ midx.c: static int write_midx_internal(const char *object_dir,
      				      flags) < 0) {
      			error(_("could not write multi-pack bitmap"));
      			result = 1;
    -+			free_packing_data(&pdata);
    ++			clear_packing_data(&pdata);
     +			free(commits);
      			goto cleanup;
      		}
     +
    -+		free_packing_data(&pdata);
    ++		clear_packing_data(&pdata);
     +		free(commits);
      	}
      	/*
    @@ pack-objects.c: void prepare_packing_data(struct repository *r, struct packing_d
      	init_recursive_mutex(&pdata->odb_lock);
      }
      
    -+void free_packing_data(struct packing_data *pdata)
    ++void clear_packing_data(struct packing_data *pdata)
     +{
     +	if (!pdata)
     +		return;
    @@ pack-objects.h: struct packing_data {
      };
      
      void prepare_packing_data(struct repository *r, struct packing_data *pdata);
    -+void free_packing_data(struct packing_data *pdata);
    ++void clear_packing_data(struct packing_data *pdata);
      
      /* Protect access to object database */
      static inline void packing_data_lock(struct packing_data *pdata)
 2:  6f5ff96998 !  2:  19cdaf59c5 pack-bitmap-write: deep-clear the `bb_commit` slab
    @@ Commit message
         bb_commit` type, and make sure it is called on each member of the slab
         via the `deep_clear_bb_data()` function.
     
    +    Note that it is possible for both of the arguments to `bitmap_free()` to
    +    be NULL, but `bitmap_free()` is a noop for NULL arguments, so it is OK
    +    to pass them unconditionally.
    +
         Signed-off-by: Taylor Blau <me@ttaylorr.com>
     
      ## pack-bitmap-write.c ##
    @@ pack-bitmap-write.c: struct bb_commit {
      
     +static void clear_bb_commit(struct bb_commit *commit)
     +{
    -+	free(commit->reverse_edges);
    ++	free_commit_list(commit->reverse_edges);
     +	bitmap_free(commit->commit_mask);
     +	bitmap_free(commit->bitmap);
     +}
 3:  bc38fba655 =  3:  477df6c974 pack-bitmap: plug leak in find_objects()
 4:  ccf1337305 !  4:  a06ed75af1 midx: factor out `fill_pack_info()`
    @@ midx.c: struct pack_info {
      };
      
     +static void fill_pack_info(struct pack_info *info,
    -+			   struct packed_git *p, char *pack_name,
    ++			   struct packed_git *p, const char *pack_name,
     +			   uint32_t orig_pack_int_id)
     +{
     +	memset(info, 0, sizeof(struct pack_info));
     +
     +	info->orig_pack_int_id = orig_pack_int_id;
    -+	info->pack_name = pack_name;
    ++	info->pack_name = xstrdup(pack_name);
     +	info->p = p;
     +}
     +
    @@ midx.c: static void add_pack_to_midx(const char *full_path, size_t full_path_len
     +		if (open_pack_index(p)) {
      			warning(_("failed to open pack-index '%s'"),
      				full_path);
    - 			close_pack(ctx->info[ctx->nr].p);
    -@@ midx.c: static void add_pack_to_midx(const char *full_path, size_t full_path_len,
    +-			close_pack(ctx->info[ctx->nr].p);
    +-			FREE_AND_NULL(ctx->info[ctx->nr].p);
    ++			close_pack(p);
    ++			free(p);
      			return;
      		}
      
     -		ctx->info[ctx->nr].pack_name = xstrdup(file_name);
     -		ctx->info[ctx->nr].orig_pack_int_id = ctx->nr;
     -		ctx->info[ctx->nr].expired = 0;
    -+		fill_pack_info(&ctx->info[ctx->nr], p, xstrdup(file_name),
    -+			       ctx->nr);
    ++		fill_pack_info(&ctx->info[ctx->nr], p, file_name, ctx->nr);
      		ctx->nr++;
      	}
      }
    @@ midx.c: static int write_midx_internal(const char *object_dir,
      
     -			ctx.nr++;
     +			fill_pack_info(&ctx.info[ctx.nr++], ctx.m->packs[i],
    -+				       xstrdup(ctx.m->pack_names[i]), i);
    ++				       ctx.m->pack_names[i], i);
      		}
      	}
      
 5:  c52d7e7b27 !  5:  6fdc68418f midx: implement `DISP` chunk
    @@ Metadata
     Author: Taylor Blau <me@ttaylorr.com>
     
      ## Commit message ##
    -    midx: implement `DISP` chunk
    +    midx: implement `BTMP` chunk
     
         When a multi-pack bitmap is used to implement verbatim pack reuse (that
         is, when verbatim chunks from an on-disk packfile are copied
    @@ Commit message
         It would be beneficial to be able to perform this same optimization over
         multiple packs, provided some modest constraints (most importantly, that
         the set of packs eligible for verbatim reuse are disjoint with respect
    -    to the objects that they contain).
    +    to the subset of their objects being sent).
     
         If we assume that the packs which we treat as candidates for verbatim
    -    reuse are disjoint with respect to their objects, we need to make only
    -    modest modifications to the verbatim pack-reuse code itself. Most
    -    notably, we need to remove the assumption that the bits in the
    -    reachability bitmap corresponding to objects from the single reuse pack
    -    begin at the first bit position.
    +    reuse are disjoint with respect to any of their objects we may output,
    +    we need to make only modest modifications to the verbatim pack-reuse
    +    code itself. Most notably, we need to remove the assumption that the
    +    bits in the reachability bitmap corresponding to objects from the single
    +    reuse pack begin at the first bit position.
     
         Future patches will unwind these assumptions and reimplement their
         existing functionality as special cases of the more general assumptions
    @@ Commit message
         to start at 0 for all existing cases).
     
         This patch does not yet relax any of those assumptions. Instead, it
    -    implements a foundational data-structure, the "Disjoint Packs" (`DISP`)
    -    chunk of the multi-pack index. The `DISP` chunk's contents are described
    -    in detail here. Importantly, the `DISP` chunk contains information to
    +    implements a foundational data-structure, the "Bitampped Packs" (`BTMP`)
    +    chunk of the multi-pack index. The `BTMP` chunk's contents are described
    +    in detail here. Importantly, the `BTMP` chunk contains information to
         map regions of a multi-pack index's reachability bitmap to the packs
         whose objects they represent.
     
    @@ Commit message
         used in this patch to test the new chunk's behavior). Future patches
         will begin to make use of this new chunk.
     
    -    This patch implements reading (though no callers outside of the above
    -    one perform any reading) and writing this new chunk. It also extends the
    -    `--stdin-packs` format used by the `git multi-pack-index write` builtin
    -    to be able to designate that a given pack is to be marked as "disjoint"
    -    by prefixing it with a '+' character.
    -
         [^1]: Modulo patching any `OFS_DELTA`'s that cross over a region of the
           pack that wasn't used verbatim.
     
         Signed-off-by: Taylor Blau <me@ttaylorr.com>
     
    - ## Documentation/git-multi-pack-index.txt ##
    -@@ Documentation/git-multi-pack-index.txt: write::
    - 	--stdin-packs::
    - 		Write a multi-pack index containing only the set of
    - 		line-delimited pack index basenames provided over stdin.
    -+		Lines beginning with a '+' character (followed by the
    -+		pack index basename as before) have their pack marked as
    -+		"disjoint". See the "`DISP` chunk and disjoint packs"
    -+		section in linkgit:gitformat-pack[5] for more.
    - 
    - 	--refs-snapshot=<path>::
    - 		With `--bitmap`, optionally specify a file which
    -
      ## Documentation/gitformat-pack.txt ##
     @@ Documentation/gitformat-pack.txt: CHUNK DATA:
      	    is padded at the end with between 0 and 3 NUL bytes to make the
      	    chunk size a multiple of 4 bytes.
      
    -+	Disjoint Packfiles (ID: {'D', 'I', 'S', 'P'})
    -+	    Stores a table of three 4-byte unsigned integers in network order.
    ++	Bitmapped Packfiles (ID: {'B', 'T', 'M', 'P'})
    ++	    Stores a table of two 4-byte unsigned integers in network order.
     +	    Each table entry corresponds to a single pack (in the order that
     +	    they appear above in the `PNAM` chunk). The values for each table
     +	    entry are as follows:
    -+	    - The first bit position (in psuedo-pack order, see below) to
    ++	    - The first bit position (in pseudo-pack order, see below) to
     +	      contain an object from that pack.
     +	    - The number of bits whose objects are selected from that pack.
    -+	    - A "meta" value, whose least-significant bit indicates whether or
    -+	      not the pack is disjoint with respect to other packs. The
    -+	      remaining bits are unused.
    -+	    Two packs are "disjoint" with respect to one another when they have
    -+	    disjoint sets of objects. In other words, any object found in a pack
    -+	    contained in the set of disjoint packfiles is guaranteed to be
    -+	    uniquely located among those packs.
     +
      	OID Fanout (ID: {'O', 'I', 'D', 'F'})
      	    The ith entry, F[i], stores the number of OIDs with first
    @@ Documentation/gitformat-pack.txt: packs arranged in MIDX order (with the preferr
      The MIDX's reverse index is stored in the optional 'RIDX' chunk within
      the MIDX itself.
      
    -+=== `DISP` chunk and disjoint packs
    ++=== `BTMP` chunk
     +
    -+The Disjoint Packfiles (`DISP`) chunk encodes additional information
    ++The Bitmapped Packfiles (`BTMP`) chunk encodes additional information
     +about the objects in the multi-pack index's reachability bitmap. Recall
    -+that objects from the MIDX are arranged in "pseudo-pack" order (see:
    ++that objects from the MIDX are arranged in "pseudo-pack" order (see
     +above) for reachability bitmaps.
     +
     +From the example above, suppose we have packs "a", "b", and "c", with
    @@ Documentation/gitformat-pack.txt: packs arranged in MIDX order (with the preferr
     +    |a,0|a,1|...|a,9|b,0|b,1|...|b,14|c,0|c,1|...|c,19|
     +
     +When working with single-pack bitmaps (or, equivalently, multi-pack
    -+reachability bitmaps without any packs marked as disjoint),
    -+linkgit:git-pack-objects[1] performs ``verbatim'' reuse, attempting to
    -+reuse chunks of the existing packfile instead of adding objects to the
    -+packing list.
    ++reachability bitmaps with a preferred pack), linkgit:git-pack-objects[1]
    ++performs ``verbatim'' reuse, attempting to reuse chunks of the bitmapped
    ++or preferred packfile instead of adding objects to the packing list.
     +
    -+When a chunk of bytes are reused from an existing pack, any objects
    ++When a chunk of bytes is reused from an existing pack, any objects
     +contained therein do not need to be added to the packing list, saving
     +memory and CPU time. But a chunk from an existing packfile can only be
     +reused when the following conditions are met:
    @@ Documentation/gitformat-pack.txt: packs arranged in MIDX order (with the preferr
     +    (i.e. does not contain any objects which the caller didn't ask for
     +    explicitly or implicitly).
     +
    -+  - All objects stored as offset- or reference-deltas also include their
    -+    base object in the resulting pack.
    ++  - All objects stored in non-thin packs as offset- or reference-deltas
    ++    also include their base object in the resulting pack.
     +
    -+Additionally, packfiles many not contain more than one copy of any given
    -+object. This introduces an additional constraint over the set of packs
    -+we may want to reuse. The most straightforward approach is to mandate
    -+that the set of packs is disjoint with respect to the set of objects
    -+contained in each pack. In other words, for each object `o` in the union
    -+of all objects stored by the disjoint set of packs, `o` is contained in
    -+exactly one pack from the disjoint set.
    -+
    -+One alternative design choice for multi-pack reuse might instead involve
    -+imposing a chunk-level constraint that allows packs in the reusable set
    -+to contain multiple copies across different packs, but restricts each
    -+chunk against including more than one copy of such an object. This is in
    -+theory possible to implement, but significantly more complicated than
    -+forcing packs themselves to be disjoint. Most notably, we would have to
    -+keep track of which objects have already been sent during verbatim
    -+pack-reuse, defeating the main purpose of verbatim pack reuse (that we
    -+don't have to keep track of individual objects).
    -+
    -+The `DISP` chunk encodes the necessary information in order to implement
    -+multi-pack reuse over a disjoint set of packs as described above.
    -+Specifically, the `DISP` chunk encodes three pieces of information (all
    ++The `BTMP` chunk encodes the necessary information in order to implement
    ++multi-pack reuse over a set of packfiles as described above.
    ++Specifically, the `BTMP` chunk encodes three pieces of information (all
     +32-bit unsigned integers in network byte-order) for each packfile `p`
     +that is stored in the MIDX, as follows:
     +
    @@ Documentation/gitformat-pack.txt: packs arranged in MIDX order (with the preferr
     +`bitmap_nr`:: The number of bit positions (including the one at
     +  `bitmap_pos`) that encode objects from that pack `p`.
     +
    -+`disjoint`:: Metadata, including whether or not the pack `p` is
    -+  ``disjoint''. The least significant bit stores whether or not the pack
    -+  is disjoint. The remaining bits are reserved for future use.
    -+
    -+For example, the `DISP` chunk corresponding to the above example (with
    ++For example, the `BTMP` chunk corresponding to the above example (with
     +packs ``a'', ``b'', and ``c'') would look like:
     +
    -+[cols="1,2,2,2"]
    ++[cols="1,2,2"]
     +|===
    -+| |`bitmap_pos` |`bitmap_nr` |`disjoint`
    ++| |`bitmap_pos` |`bitmap_nr`
     +
     +|packfile ``a''
     +|`0`
     +|`10`
    -+|`0x1`
     +
     +|packfile ``b''
     +|`10`
     +|`15`
    -+|`0x1`
     +
     +|packfile ``c''
     +|`25`
     +|`20`
    -+|`0x1`
     +|===
     +
    -+With these constraints and information in place, we can treat each
    -+packfile marked as disjoint as individually reusable in the same fashion
    -+as verbatim pack reuse is performed on individual packs prior to the
    -+implementation of the `DISP` chunk.
    ++With this information in place, we can treat each packfile as
    ++individually reusable in the same fashion as verbatim pack reuse is
    ++performed on individual packs prior to the implementation of the `BTMP`
    ++chunk.
     +
      == cruft packs
      
      The cruft packs feature offer an alternative to Git's traditional mechanism of
     
    - ## builtin/multi-pack-index.c ##
    -@@ builtin/multi-pack-index.c: static int git_multi_pack_index_write_config(const char *var, const char *value,
    - 	return 0;
    - }
    - 
    -+#define DISJOINT ((void*)(uintptr_t)1)
    -+
    - static void read_packs_from_stdin(struct string_list *to)
    - {
    - 	struct strbuf buf = STRBUF_INIT;
    --	while (strbuf_getline(&buf, stdin) != EOF)
    --		string_list_append(to, buf.buf);
    -+	while (strbuf_getline(&buf, stdin) != EOF) {
    -+		if (*buf.buf == '+')
    -+			string_list_append(to, buf.buf + 1)->util = DISJOINT;
    -+		else
    -+			string_list_append(to, buf.buf);
    -+	}
    - 	string_list_sort(to);
    - 
    - 	strbuf_release(&buf);
    -
      ## midx.c ##
     @@
      
      #define MIDX_CHUNK_ALIGNMENT 4
      #define MIDX_CHUNKID_PACKNAMES 0x504e414d /* "PNAM" */
    -+#define MIDX_CHUNKID_DISJOINTPACKS 0x44495350 /* "DISP" */
    ++#define MIDX_CHUNKID_BITMAPPEDPACKS 0x42544d50 /* "BTMP" */
      #define MIDX_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
      #define MIDX_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
      #define MIDX_CHUNKID_OBJECTOFFSETS 0x4f4f4646 /* "OOFF" */
    +@@
    + #define MIDX_CHUNK_FANOUT_SIZE (sizeof(uint32_t) * 256)
    + #define MIDX_CHUNK_OFFSET_WIDTH (2 * sizeof(uint32_t))
    + #define MIDX_CHUNK_LARGE_OFFSET_WIDTH (sizeof(uint64_t))
    ++#define MIDX_CHUNK_BITMAPPED_PACKS_WIDTH (2 * sizeof(uint32_t))
    + #define MIDX_LARGE_OFFSET_NEEDED 0x80000000
    + 
    + #define PACK_EXPIRED UINT_MAX
     @@ midx.c: struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
      
      	pair_chunk(cf, MIDX_CHUNKID_LARGEOFFSETS, &m->chunk_large_offsets,
      		   &m->chunk_large_offsets_len);
    -+	pair_chunk(cf, MIDX_CHUNKID_DISJOINTPACKS,
    -+		   (const unsigned char **)&m->chunk_disjoint_packs,
    -+		   &m->chunk_disjoint_packs_len);
    ++	pair_chunk(cf, MIDX_CHUNKID_BITMAPPEDPACKS,
    ++		   (const unsigned char **)&m->chunk_bitmapped_packs,
    ++		   &m->chunk_bitmapped_packs_len);
      
      	if (git_env_bool("GIT_TEST_MIDX_READ_RIDX", 1))
      		pair_chunk(cf, MIDX_CHUNKID_REVINDEX, &m->chunk_revindex,
    @@ midx.c: int prepare_midx_pack(struct repository *r, struct multi_pack_index *m,
     +int nth_bitmapped_pack(struct repository *r, struct multi_pack_index *m,
     +		       struct bitmapped_pack *bp, uint32_t pack_int_id)
     +{
    -+	if (!m->chunk_disjoint_packs)
    -+		return error(_("MIDX does not contain the DISP chunk"));
    ++	if (!m->chunk_bitmapped_packs)
    ++		return error(_("MIDX does not contain the BTMP chunk"));
     +
     +	if (prepare_midx_pack(r, m, pack_int_id))
    -+		return error(_("could not load disjoint pack %"PRIu32), pack_int_id);
    ++		return error(_("could not load bitmapped pack %"PRIu32), pack_int_id);
     +
     +	bp->p = m->packs[pack_int_id];
    -+	bp->bitmap_pos = get_be32(m->chunk_disjoint_packs + 3 * pack_int_id);
    -+	bp->bitmap_nr = get_be32(m->chunk_disjoint_packs + 3 * pack_int_id + 1);
    -+	bp->disjoint = !!get_be32(m->chunk_disjoint_packs + 3 * pack_int_id + 2);
    ++	bp->bitmap_pos = get_be32((char *)m->chunk_bitmapped_packs +
    ++				  MIDX_CHUNK_BITMAPPED_PACKS_WIDTH * pack_int_id);
    ++	bp->bitmap_nr = get_be32((char *)m->chunk_bitmapped_packs +
    ++				 MIDX_CHUNK_BITMAPPED_PACKS_WIDTH * pack_int_id +
    ++				 sizeof(uint32_t));
    ++	bp->pack_int_id = pack_int_id;
     +
     +	return 0;
     +}
    @@ midx.c: static size_t write_midx_header(struct hashfile *f,
      	uint32_t orig_pack_int_id;
      	char *pack_name;
      	struct packed_git *p;
    --	unsigned expired : 1;
     +
     +	uint32_t bitmap_pos;
     +	uint32_t bitmap_nr;
     +
    -+	unsigned expired : 1,
    -+		 disjoint : 1;
    + 	unsigned expired : 1;
      };
      
    - static void fill_pack_info(struct pack_info *info,
     @@ midx.c: static void fill_pack_info(struct pack_info *info,
      	info->orig_pack_int_id = orig_pack_int_id;
    - 	info->pack_name = pack_name;
    + 	info->pack_name = xstrdup(pack_name);
      	info->p = p;
     +	info->bitmap_pos = BITMAP_POS_UNKNOWN;
      }
      
      static int pack_info_compare(const void *_a, const void *_b)
    -@@ midx.c: static void add_pack_to_midx(const char *full_path, size_t full_path_len,
    - {
    - 	struct write_midx_context *ctx = data;
    - 	struct packed_git *p;
    -+	struct string_list_item *item = NULL;
    - 
    - 	if (ends_with(file_name, ".idx")) {
    - 		display_progress(ctx->progress, ++ctx->pack_paths_checked);
    -@@ midx.c: static void add_pack_to_midx(const char *full_path, size_t full_path_len,
    - 		 * should be performed independently (likely checking
    - 		 * to_include before the existing MIDX).
    - 		 */
    --		if (ctx->m && midx_contains_pack(ctx->m, file_name))
    --			return;
    --		else if (ctx->to_include &&
    --			 !string_list_has_string(ctx->to_include, file_name))
    -+		if (ctx->m && midx_contains_pack(ctx->m, file_name)) {
    - 			return;
    -+		} else if (ctx->to_include) {
    -+			item = string_list_lookup(ctx->to_include, file_name);
    -+			if (!item)
    -+				return;
    -+		}
    - 
    - 		ALLOC_GROW(ctx->info, ctx->nr + 1, ctx->alloc);
    - 
    -@@ midx.c: static void add_pack_to_midx(const char *full_path, size_t full_path_len,
    - 
    - 		fill_pack_info(&ctx->info[ctx->nr], p, xstrdup(file_name),
    - 			       ctx->nr);
    -+		if (item)
    -+			ctx->info[ctx->nr].disjoint = !!item->util;
    - 		ctx->nr++;
    - 	}
    - }
    -@@ midx.c: struct pack_midx_entry {
    - 	uint32_t pack_int_id;
    - 	time_t pack_mtime;
    - 	uint64_t offset;
    --	unsigned preferred : 1;
    -+	unsigned preferred : 1,
    -+		 disjoint : 1;
    - };
    - 
    - static int midx_oid_compare(const void *_a, const void *_b)
    -@@ midx.c: static int midx_oid_compare(const void *_a, const void *_b)
    - 	if (a->preferred < b->preferred)
    - 		return 1;
    - 
    -+	/* Sort objects in a disjoint pack last when multiple copies exist. */
    -+	if (a->disjoint < b->disjoint)
    -+		return -1;
    -+	if (a->disjoint > b->disjoint)
    -+		return 1;
    -+
    - 	if (a->pack_mtime > b->pack_mtime)
    - 		return -1;
    - 	else if (a->pack_mtime < b->pack_mtime)
    -@@ midx.c: static void midx_fanout_add_midx_fanout(struct midx_fanout *fanout,
    - 					   &fanout->entries[fanout->nr],
    - 					   cur_object);
    - 		fanout->entries[fanout->nr].preferred = 0;
    -+		fanout->entries[fanout->nr].disjoint = 0;
    - 		fanout->nr++;
    - 	}
    - }
    -@@ midx.c: static void midx_fanout_add_pack_fanout(struct midx_fanout *fanout,
    - 				cur_object,
    - 				&fanout->entries[fanout->nr],
    - 				preferred);
    -+		fanout->entries[fanout->nr].disjoint = info->disjoint;
    - 		fanout->nr++;
    - 	}
    - }
    -@@ midx.c: static struct pack_midx_entry *get_sorted_entries(struct multi_pack_index *m,
    - 		 * Take only the first duplicate.
    - 		 */
    - 		for (cur_object = 0; cur_object < fanout.nr; cur_object++) {
    --			if (cur_object && oideq(&fanout.entries[cur_object - 1].oid,
    --						&fanout.entries[cur_object].oid))
    --				continue;
    -+			struct pack_midx_entry *ours = &fanout.entries[cur_object];
    -+			if (cur_object) {
    -+				struct pack_midx_entry *prev = &fanout.entries[cur_object - 1];
    -+				if (oideq(&prev->oid, &ours->oid)) {
    -+					if (prev->disjoint && ours->disjoint)
    -+						die(_("duplicate object '%s' among disjoint packs '%s', '%s'"),
    -+						    oid_to_hex(&prev->oid),
    -+						    info[prev->pack_int_id].pack_name,
    -+						    info[ours->pack_int_id].pack_name);
    -+					continue;
    -+				}
    -+			}
    - 
    - 			ALLOC_GROW(deduplicated_entries, st_add(*nr_objects, 1),
    - 				   alloc_objects);
    --			memcpy(&deduplicated_entries[*nr_objects],
    --			       &fanout.entries[cur_object],
    -+			memcpy(&deduplicated_entries[*nr_objects], ours,
    - 			       sizeof(struct pack_midx_entry));
    - 			(*nr_objects)++;
    - 		}
     @@ midx.c: static int write_midx_pack_names(struct hashfile *f, void *data)
      	return 0;
      }
      
    -+static int write_midx_disjoint_packs(struct hashfile *f, void *data)
    ++static int write_midx_bitmapped_packs(struct hashfile *f, void *data)
     +{
     +	struct write_midx_context *ctx = data;
     +	size_t i;
    @@ midx.c: static int write_midx_pack_names(struct hashfile *f, void *data)
     +
     +		hashwrite_be32(f, pack->bitmap_pos);
     +		hashwrite_be32(f, pack->bitmap_nr);
    -+		hashwrite_be32(f, !!pack->disjoint);
     +	}
     +	return 0;
     +}
    @@ midx.c: static int write_midx_internal(const char *object_dir,
      	struct hashfile *f = NULL;
      	struct lock_file lk;
      	struct write_midx_context ctx = { 0 };
    -+	int pack_disjoint_concat_len = 0;
    ++	int bitmapped_packs_concat_len = 0;
      	int pack_name_concat_len = 0;
      	int dropped_packs = 0;
      	int result = 0;
    @@ midx.c: static int write_midx_internal(const char *object_dir,
     +		if (ctx.info[i].expired)
     +			continue;
     +		pack_name_concat_len += strlen(ctx.info[i].pack_name) + 1;
    -+		pack_disjoint_concat_len += 3 * sizeof(uint32_t);
    ++		bitmapped_packs_concat_len += 2 * sizeof(uint32_t);
      	}
      
      	/* Check that the preferred pack wasn't expired (if given). */
    @@ midx.c: static int write_midx_internal(const char *object_dir,
      		add_chunk(cf, MIDX_CHUNKID_REVINDEX,
      			  st_mult(ctx.entries_nr, sizeof(uint32_t)),
      			  write_midx_revindex);
    -+		add_chunk(cf, MIDX_CHUNKID_DISJOINTPACKS,
    -+			  pack_disjoint_concat_len, write_midx_disjoint_packs);
    ++		add_chunk(cf, MIDX_CHUNKID_BITMAPPEDPACKS,
    ++			  bitmapped_packs_concat_len,
    ++			  write_midx_bitmapped_packs);
      	}
      
      	write_midx_header(f, get_num_chunks(cf), ctx.nr - dropped_packs);
    @@ midx.h: struct multi_pack_index {
      
      	const unsigned char *chunk_pack_names;
      	size_t chunk_pack_names_len;
    -+	const uint32_t *chunk_disjoint_packs;
    -+	size_t chunk_disjoint_packs_len;
    ++	const uint32_t *chunk_bitmapped_packs;
    ++	size_t chunk_bitmapped_packs_len;
      	const uint32_t *chunk_oid_fanout;
      	const unsigned char *chunk_oid_lookup;
      	const unsigned char *chunk_object_offsets;
    @@ pack-bitmap.h: typedef int (*show_reachable_fn)(
     +	uint32_t bitmap_pos;
     +	uint32_t bitmap_nr;
     +
    -+	unsigned disjoint : 1;
    ++	uint32_t pack_int_id; /* MIDX only */
     +};
     +
      struct bitmap_index *prepare_bitmap_git(struct repository *r);
    @@ t/helper/test-read-midx.c: static int read_midx_preferred_pack(const char *objec
     +		printf("%s\n", pack_basename(pack.p));
     +		printf("  bitmap_pos: %"PRIuMAX"\n", (uintmax_t)pack.bitmap_pos);
     +		printf("  bitmap_nr: %"PRIuMAX"\n", (uintmax_t)pack.bitmap_nr);
    -+		printf("  disjoint: %s\n", pack.disjoint & 0x1 ? "yes" : "no");
     +	}
     +
     +	close_midx(midx);
    @@ t/helper/test-read-midx.c: int cmd__read_midx(int argc, const char **argv)
      }
     
      ## t/t5319-multi-pack-index.sh ##
    -@@ t/t5319-multi-pack-index.sh: test_expect_success 'reader notices too-small revindex chunk' '
    - 	test_cmp expect.err err
    +@@ t/t5319-multi-pack-index.sh: test_expect_success 'reader notices out-of-bounds fanout' '
    + 	test_cmp expect err
      '
      
    -+test_expect_success 'disjoint packs are stored via the DISP chunk' '
    ++test_expect_success 'bitmapped packs are stored via the BTMP chunk' '
     +	test_when_finished "rm -fr repo" &&
     +	git init repo &&
     +	(
    @@ t/t5319-multi-pack-index.sh: test_expect_success 'reader notices too-small revin
     +			git repack -d || return 1
     +		done &&
     +
    -+		find $objdir/pack -type f -name "*.idx" | xargs -n 1 basename | sort >packs &&
    ++		find $objdir/pack -type f -name "*.idx" | xargs -n 1 basename |
    ++		sort >packs &&
     +
     +		git multi-pack-index write --stdin-packs <packs &&
     +		test_must_fail test-tool read-midx --bitmap $objdir 2>err &&
     +		cat >expect <<-\EOF &&
    -+		error: MIDX does not contain the DISP chunk
    ++		error: MIDX does not contain the BTMP chunk
     +		EOF
     +		test_cmp expect err &&
     +
    -+		sed -e "s/^/+/g" packs >in &&
     +		git multi-pack-index write --stdin-packs --bitmap \
    -+			--preferred-pack="$(head -n1 <packs)" <in &&
    ++			--preferred-pack="$(head -n1 <packs)" <packs  &&
     +		test-tool read-midx --bitmap $objdir >actual &&
     +		for i in $(test_seq $(wc -l <packs))
     +		do
     +			sed -ne "${i}s/\.idx$/\.pack/p" packs &&
    -+			echo "  bitmap_pos: $(( $(( $i - 1 )) * 3 ))" &&
    -+			echo "  bitmap_nr: 3" &&
    -+			echo "  disjoint: yes" || return 1
    ++			echo "  bitmap_pos: $((($i - 1) * 3))" &&
    ++			echo "  bitmap_nr: 3" || return 1
     +		done >expect &&
     +		test_cmp expect actual
     +	)
     +'
    -+
    -+test_expect_success 'non-disjoint packs are detected' '
    -+	test_when_finished "rm -fr repo" &&
    -+	git init repo &&
    -+	(
    -+		cd repo &&
    -+
    -+		test_commit base &&
    -+		git repack -d &&
    -+		test_commit other &&
    -+		git repack -a &&
    -+
    -+		ls -la .git/objects/pack/ &&
    -+
    -+		find $objdir/pack -type f -name "*.idx" |
    -+			sed -e "s/.*\/\(.*\)$/+\1/g" >in &&
    -+
    -+		test_must_fail git multi-pack-index write --stdin-packs \
    -+			--bitmap <in 2>err &&
    -+		grep "duplicate object.* among disjoint packs" err
    -+	)
    -+'
     +
      test_done
 6:  541fbb442b =  6:  96f397a2b2 midx: implement `midx_locate_pack()`
 7:  3019738b52 <  -:  ---------- midx: implement `--retain-disjoint` mode
 8:  0368f7ab37 <  -:  ---------- pack-objects: implement `--ignore-disjoint` mode
 9:  b75869befb <  -:  ---------- repack: implement `--extend-disjoint` mode
10:  970bd9eaea !  7:  df9233cf0f pack-bitmap: pass `bitmapped_pack` struct to pack-reuse functions
    @@ pack-bitmap.c: int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitma
      	unuse_pack(&w_curs);
     +}
     +
    ++static int bitmapped_pack_cmp(const void *va, const void *vb)
    ++{
    ++	const struct bitmapped_pack *a = va;
    ++	const struct bitmapped_pack *b = vb;
    ++
    ++	if (a->bitmap_pos < b->bitmap_pos)
    ++		return -1;
    ++	if (a->bitmap_pos > b->bitmap_pos)
    ++		return 1;
    ++	return 0;
    ++}
    ++
     +int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
     +				       struct packed_git **packfile_out,
     +				       uint32_t *entries,
    @@ pack-bitmap.c: int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitma
     +
     +			objects_nr += pack.p->num_objects;
     +		}
    ++
    ++		QSORT(packs, packs_nr, bitmapped_pack_cmp);
     +	} else {
     +		ALLOC_GROW(packs, packs_nr + 1, packs_alloc);
     +
     +		packs[packs_nr].p = bitmap_git->pack;
     +		packs[packs_nr].bitmap_pos = 0;
     +		packs[packs_nr].bitmap_nr = bitmap_git->pack->num_objects;
    -+		packs[packs_nr].disjoint = 1;
     +
     +		objects_nr = packs[packs_nr++].p->num_objects;
     +	}
 -:  ---------- >  8:  595b3b6986 ewah: implement `bitmap_is_empty()`
11:  432854b27c !  9:  d851f821fc pack-bitmap: simplify `reuse_partial_packfile_from_bitmap()` signature
    @@ builtin/pack-objects.c: static int get_object_list_from_bitmap(struct rev_info *
      		display_progress(progress_state, nr_seen);
     
      ## pack-bitmap.c ##
    -@@ pack-bitmap.c: static void reuse_partial_packfile_from_bitmap_1(struct bitmap_index *bitmap_git
    - 	unuse_pack(&w_curs);
    +@@ pack-bitmap.c: static int bitmapped_pack_cmp(const void *va, const void *vb)
    + 	return 0;
      }
      
     -int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
    @@ pack-bitmap.c: int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitma
      
     -	*entries = bitmap_popcount(reuse);
     -	if (!*entries) {
    -+	if (!bitmap_popcount(reuse)) {
    ++	if (bitmap_is_empty(reuse)) {
     +		free(packs);
      		bitmap_free(reuse);
     -		return -1;
12:  36b794d9e1 ! 10:  f551892bab pack-bitmap: return multiple packs via `reuse_partial_packfile_from_bitmap()`
    @@ builtin/pack-objects.c: static int pack_options_allow_reuse(void)
      			BUG("expected non-empty reuse bitmap");
     
      ## pack-bitmap.c ##
    -@@ pack-bitmap.c: static void reuse_partial_packfile_from_bitmap_1(struct bitmap_index *bitmap_git
    +@@ pack-bitmap.c: static int bitmapped_pack_cmp(const void *va, const void *vb)
      }
      
      void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
13:  dca1083d8e = 11:  67e4fd8a06 pack-objects: parameterize pack-reuse routines over a single pack
14:  6f4fba861b ! 12:  9a5c38514b pack-objects: keep track of `pack_start` for each reuse pack
    @@ Commit message
         In order to compute this value correctly, we need to know not only where
         we are in the packfile we're assembling (with `hashfile_total(f)`) but
         also the position of the first byte of the packfile that we are
    -    currently reusing.
    +    currently reusing. Currently, this works just fine, since when reusing
    +    only a single pack those two values are always identical (because
    +    verbatim reuse is the first thing pack-objects does when enabled after
    +    writing the pack header).
    +
    +    But when reusing multiple packs which have one or more gaps, we'll need
    +    to account for these two values diverging.
     
         Together, these two allow us to compute the reused chunk's offset
         difference relative to the start of the reused pack, as desired.
15:  2bb01e2370 = 13:  5492d11f25 pack-objects: pass `bitmapped_pack`'s to pack-reuse functions
16:  67a8196978 ! 14:  b32742ebcb pack-objects: prepare `write_reused_pack()` for multi-pack reuse
    @@ Commit message
     
         Prepare this function for multi-pack reuse by removing the assumption
         that the bit position corresponding to the first object being reused
    -    from a given pack may not be at bit position zero.
    +    from a given pack must be at bit position zero.
     
         The changes in this function are mostly straightforward. Initialize `i`
         to the position of the first word to contain bits corresponding to that
17:  1f766648df = 15:  805c42185a pack-objects: prepare `write_reused_pack_verbatim()` for multi-pack reuse
18:  4cd9f99bfd ! 16:  55696bc1c9 pack-objects: include number of packs reused in output
    @@ builtin/pack-objects.c: int cmd_pack_objects(int argc, const char **argv, const
     +			   (uintmax_t)reuse_packfiles_used_nr);
      
      cleanup:
    - 	free_packing_data(&to_pack);
    + 	clear_packing_data(&to_pack);
19:  176c4c95ac <  -:  ---------- pack-bitmap: prepare to mark objects from multiple packs for reuse
 -:  ---------- > 17:  6ede9e0603 git-compat-util.h: implement checked size_t to uint32_t conversion
 -:  ---------- > 18:  ab0456a71e midx: implement `midx_preferred_pack()`
 -:  ---------- > 19:  14b054d272 pack-revindex: factor out `midx_key_to_pack_pos()` helper
 -:  ---------- > 20:  e99481014e pack-revindex: implement `midx_pair_to_pack_pos()`
 -:  ---------- > 21:  3e3625aebe pack-bitmap: prepare to mark objects from multiple packs for reuse
20:  46f1a03b8b ! 22:  1723cd0384 pack-objects: add tracing for various packfile metrics
    @@ builtin/pack-objects.c: int cmd_pack_objects(int argc, const char **argv, const
     +	trace2_data_intmax("pack-objects", the_repository, "packs-reused", reuse_packfiles_used_nr);
     +
      cleanup:
    - 	free_packing_data(&to_pack);
    + 	clear_packing_data(&to_pack);
      	list_objects_filter_release(&filter_options);
21:  fb764fbacc = 23:  79c830e37a t/test-lib-functions.sh: implement `test_trace2_data` helper
22:  3140a1703a ! 24:  d06b0961b5 pack-objects: allow setting `pack.allowPackReuse` to "single"
    @@ Commit message
         "single" implies the same behavior as "true", "1", "yes", and so on. But
         it will complement a new "multi" value (to be introduced in a future
         commit). When set to "single", we will only perform pack reuse on a
    -    single pack, regardless of whether or not there are multiple disjoint
    +    single pack, regardless of whether or not there are multiple MIDX'd
         packs.
     
         This requires no code changes (yet), since we only support single pack
23:  7345e39467 ! 25:  7002cf08fe pack-bitmap: reuse objects from all disjoint packs
    @@ Metadata
     Author: Taylor Blau <me@ttaylorr.com>
     
      ## Commit message ##
    -    pack-bitmap: reuse objects from all disjoint packs
    +    pack-bitmap: enable reuse from all bitmapped packs
     
         Now that both the pack-bitmap and pack-objects code are prepared to
    -    handle marking and using objects from multiple disjoint packs for
    -    verbatim reuse, allow marking objects from all disjoint packs as
    +    handle marking and using objects from multiple bitmapped packs for
    +    verbatim reuse, allow marking objects from all bitmapped packs as
         eligible for reuse.
     
         Within the `reuse_partial_packfile_from_bitmap()` function, we no longer
         only mark the pack whose first object is at bit position zero for reuse,
    -    and instead mark any pack which is flagged as disjoint by the MIDX as a
    -    reuse candidate. If no such packs exist (i.e because we are reading a
    -    MIDX written before the "DISP" chunk was introduced), then treat the
    -    preferred pack as disjoint for the purposes of reuse. This is a safe
    -    assumption to make since all duplicate objects are resolved in favor of
    -    the preferred pack.
    +    and instead mark any pack contained in the MIDX as a reuse candidate.
     
         Provide a handful of test cases in a new script (t5332) exercising
         interesting behavior for multi-pack reuse to ensure that we performed
    @@ Commit message
         Signed-off-by: Taylor Blau <me@ttaylorr.com>
     
      ## Documentation/config/pack.txt ##
    -@@ Documentation/config/pack.txt: to linkgit:git-repack[1].
    +@@ Documentation/config/pack.txt: all existing objects. You can force recompression by passing the -F option
    + to linkgit:git-repack[1].
    + 
      pack.allowPackReuse::
    - 	When true or "single", and when reachability bitmaps are enabled,
    - 	pack-objects will try to send parts of the bitmapped packfile
    +-	When true or "single", and when reachability bitmaps are enabled,
    +-	pack-objects will try to send parts of the bitmapped packfile
     -	verbatim. This can reduce memory and CPU usage to serve fetches,
    -+	verbatim. When "multi", and when a multi-pack reachability bitmap is
    -+	available, pack-objects will try to send parts of all packs marked as
    -+	disjoint by the MIDX. If only a single pack bitmap is available, and
    +-	but might result in sending a slightly larger pack. Defaults to
    +-	true.
    ++	When true or "single", and when reachability bitmaps are
    ++	enabled, pack-objects will try to send parts of the bitmapped
    ++	packfile verbatim. When "multi", and when a multi-pack
    ++	reachability bitmap is available, pack-objects will try to send
    ++	parts of all packs in the MIDX.
    +++
    ++	If only a single pack bitmap is available, and
     +	`pack.allowPackReuse` is set to "multi", reuse parts of just the
    -+	bitmapped packfile. This can reduce memory and CPU usage to serve fetches,
    - 	but might result in sending a slightly larger pack. Defaults to
    - 	true.
    ++	bitmapped packfile. This can reduce memory and CPU usage to
    ++	serve fetches, but might result in sending a slightly larger
    ++	pack. Defaults to true.
      
    + pack.island::
    + 	An extended regular expression configuring a set of delta
     
      ## builtin/pack-objects.c ##
     @@ builtin/pack-objects.c: static int use_bitmap_index = -1;
    @@ builtin/pack-objects.c: static int get_object_list_from_bitmap(struct rev_info *
      		reuse_packfile_objects = bitmap_popcount(reuse_packfile_bitmap);
     
      ## pack-bitmap.c ##
    -@@ pack-bitmap.c: static void reuse_partial_packfile_from_bitmap_1(struct bitmap_index *bitmap_git
    - 	unuse_pack(&w_curs);
    - }
    - 
    -+static void make_disjoint_pack(struct bitmapped_pack *out, struct packed_git *p)
    -+{
    -+	out->p = p;
    -+	out->bitmap_pos = 0;
    -+	out->bitmap_nr = p->num_objects;
    -+	out->disjoint = 1;
    -+}
    -+
    +@@ pack-bitmap.c: static int bitmapped_pack_cmp(const void *va, const void *vb)
      void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
      					struct bitmapped_pack **packs_out,
      					size_t *packs_nr_out,
    @@ pack-bitmap.c: void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitm
      				free(packs);
      				return;
      			}
    --			if (!pack.bitmap_nr)
    ++
    + 			if (!pack.bitmap_nr)
     -				continue; /* no objects from this pack */
     -			if (pack.bitmap_pos)
     -				continue; /* not preferred pack */
    -+
    -+			if (!pack.disjoint)
     +				continue;
     +
     +			if (!multi_pack_reuse && pack.bitmap_pos) {
    @@ pack-bitmap.c: void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitm
     +
     +			if (!multi_pack_reuse)
     +				break;
    -+		}
    -+
    -+		if (!packs_nr) {
    -+			/*
    -+			 * Old MIDXs (i.e. those written before the "DISP" chunk
    -+			 * existed) will not have any packs marked as disjoint.
    -+			 *
    -+			 * But we still want to perform pack reuse with the
    -+			 * special "preferred pack" as before. To do this, form
    -+			 * the singleton set containing just the preferred pack,
    -+			 * which is trivially disjoint with itself.
    -+			 *
    -+			 * Moreover, the MIDX is guaranteed to resolve duplicate
    -+			 * objects in favor of the copy in the preferred pack
    -+			 * (if one exists). Thus, we can safely perform pack
    -+			 * reuse on this pack.
    -+			 */
    -+			uint32_t preferred_pack_pos;
    -+			struct packed_git *preferred_pack;
    -+
    -+			preferred_pack_pos = midx_preferred_pack(bitmap_git);
    -+			preferred_pack = bitmap_git->midx->packs[preferred_pack_pos];
    -+
    -+			ALLOC_GROW(packs, packs_nr + 1, packs_alloc);
    -+
    -+			make_disjoint_pack(&packs[packs_nr], preferred_pack);
    -+			objects_nr = packs[packs_nr++].p->num_objects;
      		}
    - 	} else {
    + 
    + 		QSORT(packs, packs_nr, bitmapped_pack_cmp);
    +@@ pack-bitmap.c: void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
      		ALLOC_GROW(packs, packs_nr + 1, packs_alloc);
      
    --		packs[packs_nr].p = bitmap_git->pack;
    + 		packs[packs_nr].p = bitmap_git->pack;
     -		packs[packs_nr].bitmap_pos = 0;
    --		packs[packs_nr].bitmap_nr = bitmap_git->pack->num_objects;
    --		packs[packs_nr].disjoint = 1;
    --
    -+		make_disjoint_pack(&packs[packs_nr], bitmap_git->pack);
    - 		objects_nr = packs[packs_nr++].p->num_objects;
    + 		packs[packs_nr].bitmap_nr = bitmap_git->pack->num_objects;
    ++		packs[packs_nr].bitmap_pos = 0;
    + 
    +-		objects_nr = packs[packs_nr++].p->num_objects;
    ++		objects_nr = packs[packs_nr++].bitmap_nr;
      	}
      
    + 	word_alloc = objects_nr / BITS_IN_EWORD;
     @@ pack-bitmap.c: void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
      		word_alloc++;
      	reuse = bitmap_word_alloc(word_alloc);
    @@ pack-bitmap.c: void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitm
     +	for (i = 0; i < packs_nr; i++)
     +		reuse_partial_packfile_from_bitmap_1(bitmap_git, &packs[i], reuse);
      
    - 	if (!bitmap_popcount(reuse)) {
    + 	if (bitmap_is_empty(reuse)) {
      		free(packs);
     
      ## pack-bitmap.h ##
    -@@ pack-bitmap.h: uint32_t midx_preferred_pack(struct bitmap_index *bitmap_git);
    +@@ pack-bitmap.h: struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
      void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
      					struct bitmapped_pack **packs_out,
      					size_t *packs_nr_out,
    @@ t/t5332-multi-pack-reuse.sh (new)
     +
     +. ./test-lib.sh
     +. "$TEST_DIRECTORY"/lib-bitmap.sh
    -+. "$TEST_DIRECTORY"/lib-disjoint.sh
     +
     +objdir=.git/objects
     +packdir=$objdir/pack
     +
    -+all_packs () {
    -+	find $packdir -type f -name "*.idx" | sed -e 's/^.*\/\([^\/]\)/\1/g'
    -+}
    -+
    -+all_disjoint () {
    -+	all_packs | sed -e 's/^/+/g'
    -+}
    -+
     +test_pack_reused () {
     +	test_trace2_data pack-objects pack-reused "$1"
     +}
    @@ t/t5332-multi-pack-reuse.sh (new)
     +	grep "$1" objects | cut -d" " -f1
     +}
     +
    -+test_expect_success 'setup' '
    ++test_expect_success 'preferred pack is reused for single-pack reuse' '
    ++	test_config pack.allowPackReuse single &&
    ++
    ++	for i in A B
    ++	do
    ++		test_commit "$i" &&
    ++		git repack -d || return 1
    ++	done &&
    ++
    ++	git multi-pack-index write --bitmap &&
    ++
    ++	: >trace2.txt &&
    ++	GIT_TRACE2_EVENT="$PWD/trace2.txt" \
    ++		git pack-objects --stdout --revs --all >/dev/null &&
    ++
    ++	test_pack_reused 3 <trace2.txt &&
    ++	test_packs_reused 1 <trace2.txt
    ++'
    ++
    ++test_expect_success 'enable multi-pack reuse' '
     +	git config pack.allowPackReuse multi
     +'
     +
    -+test_expect_success 'preferred pack is reused without packs marked disjoint' '
    -+	test_commit A &&
    -+	test_commit B &&
    -+
    -+	A="$(echo A | git pack-objects --unpacked --delta-base-offset $packdir/pack)" &&
    -+	B="$(echo B | git pack-objects --unpacked --delta-base-offset $packdir/pack)" &&
    -+
    -+	git prune-packed &&
    -+
    -+	git multi-pack-index write --bitmap &&
    -+
    -+	test_must_not_be_disjoint "pack-$A.pack" &&
    -+	test_must_not_be_disjoint "pack-$B.pack" &&
    -+
    -+	: >trace2.txt &&
    -+	GIT_TRACE2_EVENT="$PWD/trace2.txt" \
    -+		git pack-objects --stdout --revs --all >/dev/null &&
    -+
    -+	test_pack_reused 3 <trace2.txt &&
    -+	test_packs_reused 1 <trace2.txt
    -+'
    -+
    -+test_expect_success 'reuse all objects from subset of disjoint packs' '
    ++test_expect_success 'reuse all objects from subset of bitmapped packs' '
     +	test_commit C &&
    ++	git repack -d &&
     +
    -+	C="$(echo C | git pack-objects --unpacked --delta-base-offset $packdir/pack)" &&
    -+
    -+	git prune-packed &&
    ++	git multi-pack-index write --bitmap &&
     +
     +	cat >in <<-EOF &&
    -+	pack-$A.idx
    -+	+pack-$B.idx
    -+	+pack-$C.idx
    ++	$(git rev-parse C)
    ++	^$(git rev-parse A)
     +	EOF
    -+	git multi-pack-index write --bitmap --stdin-packs <in &&
    -+
    -+	test_must_not_be_disjoint "pack-$A.pack" &&
    -+	test_must_be_disjoint "pack-$B.pack" &&
    -+	test_must_be_disjoint "pack-$C.pack" &&
     +
     +	: >trace2.txt &&
     +	GIT_TRACE2_EVENT="$PWD/trace2.txt" \
    -+		git pack-objects --stdout --revs --all >/dev/null &&
    ++		git pack-objects --stdout --revs <in >/dev/null &&
     +
     +	test_pack_reused 6 <trace2.txt &&
     +	test_packs_reused 2 <trace2.txt
     +'
     +
    -+test_expect_success 'reuse all objects from all disjoint packs' '
    -+	rm -fr $packdir/multi-pack-index* &&
    -+
    -+	all_disjoint >in &&
    -+	git multi-pack-index write --bitmap --stdin-packs <in &&
    -+
    -+	test_must_be_disjoint "pack-$A.pack" &&
    -+	test_must_be_disjoint "pack-$B.pack" &&
    -+	test_must_be_disjoint "pack-$C.pack" &&
    -+
    ++test_expect_success 'reuse all objects from all packs' '
     +	: >trace2.txt &&
     +	GIT_TRACE2_EVENT="$PWD/trace2.txt" \
     +		git pack-objects --stdout --revs --all >/dev/null &&
    @@ t/t5332-multi-pack-reuse.sh (new)
     +	test_packs_reused 3 <trace2.txt
     +'
     +
    -+test_expect_success 'reuse objects from first disjoint pack with middle gap' '
    -+	test_commit D &&
    -+	test_commit E &&
    -+	test_commit F &&
    ++test_expect_success 'reuse objects from first pack with middle gap' '
    ++	for i in D E F
    ++	do
    ++		test_commit "$i" || return 1
    ++	done &&
     +
     +	# Set "pack.window" to zero to ensure that we do not create any
     +	# deltas, which could alter the amount of pack reuse we perform
    @@ t/t5332-multi-pack-reuse.sh (new)
     +	# Ensure that the pack we are constructing sorts ahead of any
     +	# other packs in lexical/bitmap order by choosing it as the
     +	# preferred pack.
    -+	all_disjoint >in &&
    -+	git multi-pack-index write --bitmap --preferred-pack="pack-$D.idx" \
    -+		--stdin-packs <in &&
    -+
    -+	test_must_be_disjoint pack-$A.pack &&
    -+	test_must_be_disjoint pack-$B.pack &&
    -+	test_must_be_disjoint pack-$C.pack &&
    -+	test_must_be_disjoint pack-$D.pack &&
    ++	git multi-pack-index write --bitmap --preferred-pack="pack-$D.idx" &&
     +
     +	cat >in <<-EOF &&
     +	$(git rev-parse E)
    @@ t/t5332-multi-pack-reuse.sh (new)
     +	test_packs_reused 1 <trace2.txt
     +'
     +
    -+test_expect_success 'reuse objects from middle disjoint pack with middle gap' '
    ++test_expect_success 'reuse objects from middle pack with middle gap' '
     +	rm -fr $packdir/multi-pack-index* &&
     +
     +	# Ensure that the pack we are constructing sort into any
     +	# position *but* the first one, by choosing a different pack as
     +	# the preferred one.
    -+	all_disjoint >in &&
    -+	git multi-pack-index write --bitmap --preferred-pack="pack-$A.idx" \
    -+		--stdin-packs <in &&
    ++	git multi-pack-index write --bitmap --preferred-pack="pack-$A.idx" &&
     +
     +	cat >in <<-EOF &&
     +	$(git rev-parse E)
    @@ t/t5332-multi-pack-reuse.sh (new)
     +	test_packs_reused 1 <trace2.txt
     +'
     +
    -+test_expect_success 'omit delta with uninteresting base' '
    ++test_expect_success 'omit delta with uninteresting base (same pack)' '
     +	git repack -adk &&
     +
     +	test_seq 32 >f &&
    @@ t/t5332-multi-pack-reuse.sh (new)
     +
     +	have_delta "$(git rev-parse $delta:f)" "$(git rev-parse $base:f)" &&
     +
    -+	all_disjoint >in &&
    -+	git multi-pack-index write --bitmap --stdin-packs <in &&
    ++	git multi-pack-index write --bitmap &&
     +
     +	cat >in <<-EOF &&
     +	$(git rev-parse other)
    @@ t/t5332-multi-pack-reuse.sh (new)
     +	GIT_TRACE2_EVENT="$PWD/trace2.txt" \
     +		git pack-objects --stdout --delta-base-offset --revs <in >/dev/null &&
     +
    -+	# Even though all packs are marked disjoint, we can only reuse
    -+	# the 3 objects corresponding to "other" from the latest pack.
    ++	# We can only reuse the 3 objects corresponding to "other" from
    ++	# the latest pack.
     +	#
     +	# This is because even though we want "delta", we do not want
     +	# "base", meaning that we have to inflate the delta/base-pair
    @@ t/t5332-multi-pack-reuse.sh (new)
     +	test_packs_reused 1 <trace2.txt
     +'
     +
    ++test_expect_success 'omit delta from uninteresting base (cross pack)' '
    ++	cat >in <<-EOF &&
    ++	$(git rev-parse $base)
    ++	^$(git rev-parse $delta)
    ++	EOF
    ++
    ++	P="$(git pack-objects --revs $packdir/pack <in)" &&
    ++
    ++	git multi-pack-index write --bitmap --preferred-pack="pack-$P.idx" &&
    ++
    ++	: >trace2.txt &&
    ++	GIT_TRACE2_EVENT="$PWD/trace2.txt" \
    ++		git pack-objects --stdout --delta-base-offset --all >/dev/null &&
    ++
    ++	packs_nr="$(find $packdir -type f -name "pack-*.pack" | wc -l)" &&
    ++	objects_nr="$(git rev-list --count --all --objects)" &&
    ++
    ++	test_pack_reused $(($objects_nr - 1)) <trace2.txt &&
    ++	test_packs_reused $packs_nr <trace2.txt
    ++'
    ++
     +test_done
24:  980b318f98 = 26:  94e5ae4cf6 t/perf: add performance tests for multi-pack reuse

base-commit: 1a87c842ece327d03d08096395969aca5e0a6996
-- 
2.43.0.102.ga31d690331.dirty

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH v2 01/26] pack-objects: free packing_data in more places
  2023-12-14 22:23 ` [PATCH v2 00/26] " Taylor Blau
@ 2023-12-14 22:23   ` Taylor Blau
  2023-12-14 22:23   ` [PATCH v2 02/26] pack-bitmap-write: deep-clear the `bb_commit` slab Taylor Blau
                     ` (25 subsequent siblings)
  26 siblings, 0 replies; 107+ messages in thread
From: Taylor Blau @ 2023-12-14 22:23 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano

The pack-objects internals use a packing_data struct to track what
objects are part of the pack(s) being formed.

Since these structures contain allocated fields, failing to
appropriately free() them results in a leak. Plug that leak by
introducing a clear_packing_data() function, and call it in the
appropriate spots.

This is a fairly straightforward leak to plug, since none of the callers
expect to read any values or have any references to parts of the address
space being freed.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c |  1 +
 midx.c                 |  5 +++++
 pack-objects.c         | 15 +++++++++++++++
 pack-objects.h         |  1 +
 4 files changed, 22 insertions(+)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 89a8b5a976..321d7effb0 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -4522,6 +4522,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 			   reuse_packfile_objects);
 
 cleanup:
+	clear_packing_data(&to_pack);
 	list_objects_filter_release(&filter_options);
 	strvec_clear(&rp);
 
diff --git a/midx.c b/midx.c
index 1d14661dad..778dd536c8 100644
--- a/midx.c
+++ b/midx.c
@@ -1603,8 +1603,13 @@ static int write_midx_internal(const char *object_dir,
 				      flags) < 0) {
 			error(_("could not write multi-pack bitmap"));
 			result = 1;
+			clear_packing_data(&pdata);
+			free(commits);
 			goto cleanup;
 		}
+
+		clear_packing_data(&pdata);
+		free(commits);
 	}
 	/*
 	 * NOTE: Do not use ctx.entries beyond this point, since it might
diff --git a/pack-objects.c b/pack-objects.c
index f403ca6986..a9d9855063 100644
--- a/pack-objects.c
+++ b/pack-objects.c
@@ -151,6 +151,21 @@ void prepare_packing_data(struct repository *r, struct packing_data *pdata)
 	init_recursive_mutex(&pdata->odb_lock);
 }
 
+void clear_packing_data(struct packing_data *pdata)
+{
+	if (!pdata)
+		return;
+
+	free(pdata->cruft_mtime);
+	free(pdata->in_pack);
+	free(pdata->in_pack_by_idx);
+	free(pdata->in_pack_pos);
+	free(pdata->index);
+	free(pdata->layer);
+	free(pdata->objects);
+	free(pdata->tree_depth);
+}
+
 struct object_entry *packlist_alloc(struct packing_data *pdata,
 				    const struct object_id *oid)
 {
diff --git a/pack-objects.h b/pack-objects.h
index 0d78db40cb..b9898a4e64 100644
--- a/pack-objects.h
+++ b/pack-objects.h
@@ -169,6 +169,7 @@ struct packing_data {
 };
 
 void prepare_packing_data(struct repository *r, struct packing_data *pdata);
+void clear_packing_data(struct packing_data *pdata);
 
 /* Protect access to object database */
 static inline void packing_data_lock(struct packing_data *pdata)
-- 
2.43.0.102.ga31d690331.dirty


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH v2 02/26] pack-bitmap-write: deep-clear the `bb_commit` slab
  2023-12-14 22:23 ` [PATCH v2 00/26] " Taylor Blau
  2023-12-14 22:23   ` [PATCH v2 01/26] pack-objects: free packing_data in more places Taylor Blau
@ 2023-12-14 22:23   ` Taylor Blau
  2023-12-14 22:23   ` [PATCH v2 03/26] pack-bitmap: plug leak in find_objects() Taylor Blau
                     ` (24 subsequent siblings)
  26 siblings, 0 replies; 107+ messages in thread
From: Taylor Blau @ 2023-12-14 22:23 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano

The `bb_commit` commit slab is used by the pack-bitmap-write machinery
to track various pieces of bookkeeping used to generate reachability
bitmaps.

Even though we clear the slab when freeing the bitmap_builder struct
(with `bitmap_builder_clear()`), there are still pointers which point to
locations in memory that have not yet been freed, resulting in a leak.

Plug the leak by introducing a suitable `free_fn` for the `struct
bb_commit` type, and make sure it is called on each member of the slab
via the `deep_clear_bb_data()` function.

Note that it is possible for both of the arguments to `bitmap_free()` to
be NULL, but `bitmap_free()` is a noop for NULL arguments, so it is OK
to pass them unconditionally.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-bitmap-write.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c
index f4ecdf8b0e..ae37fb6976 100644
--- a/pack-bitmap-write.c
+++ b/pack-bitmap-write.c
@@ -198,6 +198,13 @@ struct bb_commit {
 	unsigned idx; /* within selected array */
 };
 
+static void clear_bb_commit(struct bb_commit *commit)
+{
+	free_commit_list(commit->reverse_edges);
+	bitmap_free(commit->commit_mask);
+	bitmap_free(commit->bitmap);
+}
+
 define_commit_slab(bb_data, struct bb_commit);
 
 struct bitmap_builder {
@@ -339,7 +346,7 @@ static void bitmap_builder_init(struct bitmap_builder *bb,
 
 static void bitmap_builder_clear(struct bitmap_builder *bb)
 {
-	clear_bb_data(&bb->data);
+	deep_clear_bb_data(&bb->data, clear_bb_commit);
 	free(bb->commits);
 	bb->commits_nr = bb->commits_alloc = 0;
 }
-- 
2.43.0.102.ga31d690331.dirty


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH v2 03/26] pack-bitmap: plug leak in find_objects()
  2023-12-14 22:23 ` [PATCH v2 00/26] " Taylor Blau
  2023-12-14 22:23   ` [PATCH v2 01/26] pack-objects: free packing_data in more places Taylor Blau
  2023-12-14 22:23   ` [PATCH v2 02/26] pack-bitmap-write: deep-clear the `bb_commit` slab Taylor Blau
@ 2023-12-14 22:23   ` Taylor Blau
  2023-12-14 22:23   ` [PATCH v2 04/26] midx: factor out `fill_pack_info()` Taylor Blau
                     ` (23 subsequent siblings)
  26 siblings, 0 replies; 107+ messages in thread
From: Taylor Blau @ 2023-12-14 22:23 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano

The `find_objects()` function creates an object_list for any tips of the
reachability query which do not have corresponding bitmaps.

The object_list is not used outside of `find_objects()`, but we never
free it with `object_list_free()`, resulting in a leak. Let's plug that
leak by calling `object_list_free()`, which results in t6113 becoming
leak-free.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-bitmap.c                      | 2 ++
 t/t6113-rev-list-bitmap-filters.sh | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index 0260890341..d2f1306960 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -1280,6 +1280,8 @@ static struct bitmap *find_objects(struct bitmap_index *bitmap_git,
 		base = fill_in_bitmap(bitmap_git, revs, base, seen);
 	}
 
+	object_list_free(&not_mapped);
+
 	return base;
 }
 
diff --git a/t/t6113-rev-list-bitmap-filters.sh b/t/t6113-rev-list-bitmap-filters.sh
index 86c70521f1..459f0d7412 100755
--- a/t/t6113-rev-list-bitmap-filters.sh
+++ b/t/t6113-rev-list-bitmap-filters.sh
@@ -4,6 +4,8 @@ test_description='rev-list combining bitmaps and filters'
 . ./test-lib.sh
 . "$TEST_DIRECTORY"/lib-bitmap.sh
 
+TEST_PASSES_SANITIZE_LEAK=true
+
 test_expect_success 'set up bitmapped repo' '
 	# one commit will have bitmaps, the other will not
 	test_commit one &&
-- 
2.43.0.102.ga31d690331.dirty


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH v2 04/26] midx: factor out `fill_pack_info()`
  2023-12-14 22:23 ` [PATCH v2 00/26] " Taylor Blau
                     ` (2 preceding siblings ...)
  2023-12-14 22:23   ` [PATCH v2 03/26] pack-bitmap: plug leak in find_objects() Taylor Blau
@ 2023-12-14 22:23   ` Taylor Blau
  2023-12-14 22:23   ` [PATCH v2 05/26] midx: implement `BTMP` chunk Taylor Blau
                     ` (22 subsequent siblings)
  26 siblings, 0 replies; 107+ messages in thread
From: Taylor Blau @ 2023-12-14 22:23 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano

When selecting which packfiles will be written while generating a MIDX,
the MIDX internals fill out a 'struct pack_info' with various pieces of
book-keeping.

Instead of filling out each field of the `pack_info` structure
individually in each of the two spots that modify the array of such
structures (`ctx->info`), extract a common routine that does this for
us.

This reduces the code duplication by a modest amount. But more
importantly, it zero-initializes the structure before assigning values
into it. This hardens us for a future change which will add additional
fields to this structure which (until this patch) was not
zero-initialized.

As a result, any new fields added to the `pack_info` structure need only
be updated in a single location, instead of at each spot within midx.c.

There are no functional changes in this patch.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 midx.c | 38 ++++++++++++++++++++------------------
 1 file changed, 20 insertions(+), 18 deletions(-)

diff --git a/midx.c b/midx.c
index 778dd536c8..8dba67ddbe 100644
--- a/midx.c
+++ b/midx.c
@@ -475,6 +475,17 @@ struct pack_info {
 	unsigned expired : 1;
 };
 
+static void fill_pack_info(struct pack_info *info,
+			   struct packed_git *p, const char *pack_name,
+			   uint32_t orig_pack_int_id)
+{
+	memset(info, 0, sizeof(struct pack_info));
+
+	info->orig_pack_int_id = orig_pack_int_id;
+	info->pack_name = xstrdup(pack_name);
+	info->p = p;
+}
+
 static int pack_info_compare(const void *_a, const void *_b)
 {
 	struct pack_info *a = (struct pack_info *)_a;
@@ -515,6 +526,7 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
 			     const char *file_name, void *data)
 {
 	struct write_midx_context *ctx = data;
+	struct packed_git *p;
 
 	if (ends_with(file_name, ".idx")) {
 		display_progress(ctx->progress, ++ctx->pack_paths_checked);
@@ -541,27 +553,22 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
 
 		ALLOC_GROW(ctx->info, ctx->nr + 1, ctx->alloc);
 
-		ctx->info[ctx->nr].p = add_packed_git(full_path,
-						      full_path_len,
-						      0);
-
-		if (!ctx->info[ctx->nr].p) {
+		p = add_packed_git(full_path, full_path_len, 0);
+		if (!p) {
 			warning(_("failed to add packfile '%s'"),
 				full_path);
 			return;
 		}
 
-		if (open_pack_index(ctx->info[ctx->nr].p)) {
+		if (open_pack_index(p)) {
 			warning(_("failed to open pack-index '%s'"),
 				full_path);
-			close_pack(ctx->info[ctx->nr].p);
-			FREE_AND_NULL(ctx->info[ctx->nr].p);
+			close_pack(p);
+			free(p);
 			return;
 		}
 
-		ctx->info[ctx->nr].pack_name = xstrdup(file_name);
-		ctx->info[ctx->nr].orig_pack_int_id = ctx->nr;
-		ctx->info[ctx->nr].expired = 0;
+		fill_pack_info(&ctx->info[ctx->nr], p, file_name, ctx->nr);
 		ctx->nr++;
 	}
 }
@@ -1321,11 +1328,6 @@ static int write_midx_internal(const char *object_dir,
 		for (i = 0; i < ctx.m->num_packs; i++) {
 			ALLOC_GROW(ctx.info, ctx.nr + 1, ctx.alloc);
 
-			ctx.info[ctx.nr].orig_pack_int_id = i;
-			ctx.info[ctx.nr].pack_name = xstrdup(ctx.m->pack_names[i]);
-			ctx.info[ctx.nr].p = ctx.m->packs[i];
-			ctx.info[ctx.nr].expired = 0;
-
 			if (flags & MIDX_WRITE_REV_INDEX) {
 				/*
 				 * If generating a reverse index, need to have
@@ -1341,10 +1343,10 @@ static int write_midx_internal(const char *object_dir,
 				if (open_pack_index(ctx.m->packs[i]))
 					die(_("could not open index for %s"),
 					    ctx.m->packs[i]->pack_name);
-				ctx.info[ctx.nr].p = ctx.m->packs[i];
 			}
 
-			ctx.nr++;
+			fill_pack_info(&ctx.info[ctx.nr++], ctx.m->packs[i],
+				       ctx.m->pack_names[i], i);
 		}
 	}
 
-- 
2.43.0.102.ga31d690331.dirty


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH v2 05/26] midx: implement `BTMP` chunk
  2023-12-14 22:23 ` [PATCH v2 00/26] " Taylor Blau
                     ` (3 preceding siblings ...)
  2023-12-14 22:23   ` [PATCH v2 04/26] midx: factor out `fill_pack_info()` Taylor Blau
@ 2023-12-14 22:23   ` Taylor Blau
  2023-12-14 22:23   ` [PATCH v2 06/26] midx: implement `midx_locate_pack()` Taylor Blau
                     ` (21 subsequent siblings)
  26 siblings, 0 replies; 107+ messages in thread
From: Taylor Blau @ 2023-12-14 22:23 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano

When a multi-pack bitmap is used to implement verbatim pack reuse (that
is, when verbatim chunks from an on-disk packfile are copied
directly[^1]), it does so by using its "preferred pack" as the source
for pack-reuse.

This allows repositories to pack the majority of their objects into a
single (often large) pack, and then use it as the single source for
verbatim pack reuse. This increases the amount of objects that are
reused verbatim (and consequently, decrease the amount of time it takes
to generate many packs). But this performance comes at a cost, which is
that the preferred packfile must pace its growth with that of the entire
repository in order to maintain the utility of verbatim pack reuse.

As repositories grow beyond what we can reasonably store in a single
packfile, the utility of verbatim pack reuse diminishes. Or, at the very
least, it becomes increasingly more expensive to maintain as the pack
grows larger and larger.

It would be beneficial to be able to perform this same optimization over
multiple packs, provided some modest constraints (most importantly, that
the set of packs eligible for verbatim reuse are disjoint with respect
to the subset of their objects being sent).

If we assume that the packs which we treat as candidates for verbatim
reuse are disjoint with respect to any of their objects we may output,
we need to make only modest modifications to the verbatim pack-reuse
code itself. Most notably, we need to remove the assumption that the
bits in the reachability bitmap corresponding to objects from the single
reuse pack begin at the first bit position.

Future patches will unwind these assumptions and reimplement their
existing functionality as special cases of the more general assumptions
(e.g. that reuse bits can start anywhere within the bitset, but happen
to start at 0 for all existing cases).

This patch does not yet relax any of those assumptions. Instead, it
implements a foundational data-structure, the "Bitampped Packs" (`BTMP`)
chunk of the multi-pack index. The `BTMP` chunk's contents are described
in detail here. Importantly, the `BTMP` chunk contains information to
map regions of a multi-pack index's reachability bitmap to the packs
whose objects they represent.

For now, this chunk is only written, not read (outside of the test-tool
used in this patch to test the new chunk's behavior). Future patches
will begin to make use of this new chunk.

[^1]: Modulo patching any `OFS_DELTA`'s that cross over a region of the
  pack that wasn't used verbatim.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/gitformat-pack.txt | 76 ++++++++++++++++++++++++++++++++
 midx.c                           | 75 +++++++++++++++++++++++++++++--
 midx.h                           |  5 +++
 pack-bitmap.h                    |  9 ++++
 t/helper/test-read-midx.c        | 30 ++++++++++++-
 t/t5319-multi-pack-index.sh      | 35 +++++++++++++++
 6 files changed, 226 insertions(+), 4 deletions(-)

diff --git a/Documentation/gitformat-pack.txt b/Documentation/gitformat-pack.txt
index 9fcb29a9c8..d6ae229be5 100644
--- a/Documentation/gitformat-pack.txt
+++ b/Documentation/gitformat-pack.txt
@@ -396,6 +396,15 @@ CHUNK DATA:
 	    is padded at the end with between 0 and 3 NUL bytes to make the
 	    chunk size a multiple of 4 bytes.
 
+	Bitmapped Packfiles (ID: {'B', 'T', 'M', 'P'})
+	    Stores a table of two 4-byte unsigned integers in network order.
+	    Each table entry corresponds to a single pack (in the order that
+	    they appear above in the `PNAM` chunk). The values for each table
+	    entry are as follows:
+	    - The first bit position (in pseudo-pack order, see below) to
+	      contain an object from that pack.
+	    - The number of bits whose objects are selected from that pack.
+
 	OID Fanout (ID: {'O', 'I', 'D', 'F'})
 	    The ith entry, F[i], stores the number of OIDs with first
 	    byte at most i. Thus F[255] stores the total
@@ -509,6 +518,73 @@ packs arranged in MIDX order (with the preferred pack coming first).
 The MIDX's reverse index is stored in the optional 'RIDX' chunk within
 the MIDX itself.
 
+=== `BTMP` chunk
+
+The Bitmapped Packfiles (`BTMP`) chunk encodes additional information
+about the objects in the multi-pack index's reachability bitmap. Recall
+that objects from the MIDX are arranged in "pseudo-pack" order (see
+above) for reachability bitmaps.
+
+From the example above, suppose we have packs "a", "b", and "c", with
+10, 15, and 20 objects, respectively. In pseudo-pack order, those would
+be arranged as follows:
+
+    |a,0|a,1|...|a,9|b,0|b,1|...|b,14|c,0|c,1|...|c,19|
+
+When working with single-pack bitmaps (or, equivalently, multi-pack
+reachability bitmaps with a preferred pack), linkgit:git-pack-objects[1]
+performs ``verbatim'' reuse, attempting to reuse chunks of the bitmapped
+or preferred packfile instead of adding objects to the packing list.
+
+When a chunk of bytes is reused from an existing pack, any objects
+contained therein do not need to be added to the packing list, saving
+memory and CPU time. But a chunk from an existing packfile can only be
+reused when the following conditions are met:
+
+  - The chunk contains only objects which were requested by the caller
+    (i.e. does not contain any objects which the caller didn't ask for
+    explicitly or implicitly).
+
+  - All objects stored in non-thin packs as offset- or reference-deltas
+    also include their base object in the resulting pack.
+
+The `BTMP` chunk encodes the necessary information in order to implement
+multi-pack reuse over a set of packfiles as described above.
+Specifically, the `BTMP` chunk encodes three pieces of information (all
+32-bit unsigned integers in network byte-order) for each packfile `p`
+that is stored in the MIDX, as follows:
+
+`bitmap_pos`:: The first bit position (in pseudo-pack order) in the
+  multi-pack index's reachability bitmap occupied by an object from `p`.
+
+`bitmap_nr`:: The number of bit positions (including the one at
+  `bitmap_pos`) that encode objects from that pack `p`.
+
+For example, the `BTMP` chunk corresponding to the above example (with
+packs ``a'', ``b'', and ``c'') would look like:
+
+[cols="1,2,2"]
+|===
+| |`bitmap_pos` |`bitmap_nr`
+
+|packfile ``a''
+|`0`
+|`10`
+
+|packfile ``b''
+|`10`
+|`15`
+
+|packfile ``c''
+|`25`
+|`20`
+|===
+
+With this information in place, we can treat each packfile as
+individually reusable in the same fashion as verbatim pack reuse is
+performed on individual packs prior to the implementation of the `BTMP`
+chunk.
+
 == cruft packs
 
 The cruft packs feature offer an alternative to Git's traditional mechanism of
diff --git a/midx.c b/midx.c
index 8dba67ddbe..de25612b0c 100644
--- a/midx.c
+++ b/midx.c
@@ -33,6 +33,7 @@
 
 #define MIDX_CHUNK_ALIGNMENT 4
 #define MIDX_CHUNKID_PACKNAMES 0x504e414d /* "PNAM" */
+#define MIDX_CHUNKID_BITMAPPEDPACKS 0x42544d50 /* "BTMP" */
 #define MIDX_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
 #define MIDX_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
 #define MIDX_CHUNKID_OBJECTOFFSETS 0x4f4f4646 /* "OOFF" */
@@ -41,6 +42,7 @@
 #define MIDX_CHUNK_FANOUT_SIZE (sizeof(uint32_t) * 256)
 #define MIDX_CHUNK_OFFSET_WIDTH (2 * sizeof(uint32_t))
 #define MIDX_CHUNK_LARGE_OFFSET_WIDTH (sizeof(uint64_t))
+#define MIDX_CHUNK_BITMAPPED_PACKS_WIDTH (2 * sizeof(uint32_t))
 #define MIDX_LARGE_OFFSET_NEEDED 0x80000000
 
 #define PACK_EXPIRED UINT_MAX
@@ -193,6 +195,9 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
 
 	pair_chunk(cf, MIDX_CHUNKID_LARGEOFFSETS, &m->chunk_large_offsets,
 		   &m->chunk_large_offsets_len);
+	pair_chunk(cf, MIDX_CHUNKID_BITMAPPEDPACKS,
+		   (const unsigned char **)&m->chunk_bitmapped_packs,
+		   &m->chunk_bitmapped_packs_len);
 
 	if (git_env_bool("GIT_TEST_MIDX_READ_RIDX", 1))
 		pair_chunk(cf, MIDX_CHUNKID_REVINDEX, &m->chunk_revindex,
@@ -286,6 +291,26 @@ int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, uint32_t
 	return 0;
 }
 
+int nth_bitmapped_pack(struct repository *r, struct multi_pack_index *m,
+		       struct bitmapped_pack *bp, uint32_t pack_int_id)
+{
+	if (!m->chunk_bitmapped_packs)
+		return error(_("MIDX does not contain the BTMP chunk"));
+
+	if (prepare_midx_pack(r, m, pack_int_id))
+		return error(_("could not load bitmapped pack %"PRIu32), pack_int_id);
+
+	bp->p = m->packs[pack_int_id];
+	bp->bitmap_pos = get_be32((char *)m->chunk_bitmapped_packs +
+				  MIDX_CHUNK_BITMAPPED_PACKS_WIDTH * pack_int_id);
+	bp->bitmap_nr = get_be32((char *)m->chunk_bitmapped_packs +
+				 MIDX_CHUNK_BITMAPPED_PACKS_WIDTH * pack_int_id +
+				 sizeof(uint32_t));
+	bp->pack_int_id = pack_int_id;
+
+	return 0;
+}
+
 int bsearch_midx(const struct object_id *oid, struct multi_pack_index *m, uint32_t *result)
 {
 	return bsearch_hash(oid->hash, m->chunk_oid_fanout, m->chunk_oid_lookup,
@@ -468,10 +493,16 @@ static size_t write_midx_header(struct hashfile *f,
 	return MIDX_HEADER_SIZE;
 }
 
+#define BITMAP_POS_UNKNOWN (~((uint32_t)0))
+
 struct pack_info {
 	uint32_t orig_pack_int_id;
 	char *pack_name;
 	struct packed_git *p;
+
+	uint32_t bitmap_pos;
+	uint32_t bitmap_nr;
+
 	unsigned expired : 1;
 };
 
@@ -484,6 +515,7 @@ static void fill_pack_info(struct pack_info *info,
 	info->orig_pack_int_id = orig_pack_int_id;
 	info->pack_name = xstrdup(pack_name);
 	info->p = p;
+	info->bitmap_pos = BITMAP_POS_UNKNOWN;
 }
 
 static int pack_info_compare(const void *_a, const void *_b)
@@ -824,6 +856,26 @@ static int write_midx_pack_names(struct hashfile *f, void *data)
 	return 0;
 }
 
+static int write_midx_bitmapped_packs(struct hashfile *f, void *data)
+{
+	struct write_midx_context *ctx = data;
+	size_t i;
+
+	for (i = 0; i < ctx->nr; i++) {
+		struct pack_info *pack = &ctx->info[i];
+		if (pack->expired)
+			continue;
+
+		if (pack->bitmap_pos == BITMAP_POS_UNKNOWN && pack->bitmap_nr)
+			BUG("pack '%s' has no bitmap position, but has %d bitmapped object(s)",
+			    pack->pack_name, pack->bitmap_nr);
+
+		hashwrite_be32(f, pack->bitmap_pos);
+		hashwrite_be32(f, pack->bitmap_nr);
+	}
+	return 0;
+}
+
 static int write_midx_oid_fanout(struct hashfile *f,
 				 void *data)
 {
@@ -991,8 +1043,19 @@ static uint32_t *midx_pack_order(struct write_midx_context *ctx)
 	QSORT(data, ctx->entries_nr, midx_pack_order_cmp);
 
 	ALLOC_ARRAY(pack_order, ctx->entries_nr);
-	for (i = 0; i < ctx->entries_nr; i++)
+	for (i = 0; i < ctx->entries_nr; i++) {
+		struct pack_midx_entry *e = &ctx->entries[data[i].nr];
+		struct pack_info *pack = &ctx->info[ctx->pack_perm[e->pack_int_id]];
+		if (pack->bitmap_pos == BITMAP_POS_UNKNOWN)
+			pack->bitmap_pos = i;
+		pack->bitmap_nr++;
 		pack_order[i] = data[i].nr;
+	}
+	for (i = 0; i < ctx->nr; i++) {
+		struct pack_info *pack = &ctx->info[ctx->pack_perm[i]];
+		if (pack->bitmap_pos == BITMAP_POS_UNKNOWN)
+			pack->bitmap_pos = 0;
+	}
 	free(data);
 
 	trace2_region_leave("midx", "midx_pack_order", the_repository);
@@ -1293,6 +1356,7 @@ static int write_midx_internal(const char *object_dir,
 	struct hashfile *f = NULL;
 	struct lock_file lk;
 	struct write_midx_context ctx = { 0 };
+	int bitmapped_packs_concat_len = 0;
 	int pack_name_concat_len = 0;
 	int dropped_packs = 0;
 	int result = 0;
@@ -1505,8 +1569,10 @@ static int write_midx_internal(const char *object_dir,
 	}
 
 	for (i = 0; i < ctx.nr; i++) {
-		if (!ctx.info[i].expired)
-			pack_name_concat_len += strlen(ctx.info[i].pack_name) + 1;
+		if (ctx.info[i].expired)
+			continue;
+		pack_name_concat_len += strlen(ctx.info[i].pack_name) + 1;
+		bitmapped_packs_concat_len += 2 * sizeof(uint32_t);
 	}
 
 	/* Check that the preferred pack wasn't expired (if given). */
@@ -1566,6 +1632,9 @@ static int write_midx_internal(const char *object_dir,
 		add_chunk(cf, MIDX_CHUNKID_REVINDEX,
 			  st_mult(ctx.entries_nr, sizeof(uint32_t)),
 			  write_midx_revindex);
+		add_chunk(cf, MIDX_CHUNKID_BITMAPPEDPACKS,
+			  bitmapped_packs_concat_len,
+			  write_midx_bitmapped_packs);
 	}
 
 	write_midx_header(f, get_num_chunks(cf), ctx.nr - dropped_packs);
diff --git a/midx.h b/midx.h
index a5d98919c8..b404235db5 100644
--- a/midx.h
+++ b/midx.h
@@ -7,6 +7,7 @@
 struct object_id;
 struct pack_entry;
 struct repository;
+struct bitmapped_pack;
 
 #define GIT_TEST_MULTI_PACK_INDEX "GIT_TEST_MULTI_PACK_INDEX"
 #define GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP \
@@ -33,6 +34,8 @@ struct multi_pack_index {
 
 	const unsigned char *chunk_pack_names;
 	size_t chunk_pack_names_len;
+	const uint32_t *chunk_bitmapped_packs;
+	size_t chunk_bitmapped_packs_len;
 	const uint32_t *chunk_oid_fanout;
 	const unsigned char *chunk_oid_lookup;
 	const unsigned char *chunk_object_offsets;
@@ -58,6 +61,8 @@ void get_midx_rev_filename(struct strbuf *out, struct multi_pack_index *m);
 
 struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local);
 int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, uint32_t pack_int_id);
+int nth_bitmapped_pack(struct repository *r, struct multi_pack_index *m,
+		       struct bitmapped_pack *bp, uint32_t pack_int_id);
 int bsearch_midx(const struct object_id *oid, struct multi_pack_index *m, uint32_t *result);
 off_t nth_midxed_offset(struct multi_pack_index *m, uint32_t pos);
 uint32_t nth_midxed_pack_int_id(struct multi_pack_index *m, uint32_t pos);
diff --git a/pack-bitmap.h b/pack-bitmap.h
index 5273a6a019..b68b213388 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -52,6 +52,15 @@ typedef int (*show_reachable_fn)(
 
 struct bitmap_index;
 
+struct bitmapped_pack {
+	struct packed_git *p;
+
+	uint32_t bitmap_pos;
+	uint32_t bitmap_nr;
+
+	uint32_t pack_int_id; /* MIDX only */
+};
+
 struct bitmap_index *prepare_bitmap_git(struct repository *r);
 struct bitmap_index *prepare_midx_bitmap_git(struct multi_pack_index *midx);
 void count_bitmap_commit_list(struct bitmap_index *, uint32_t *commits,
diff --git a/t/helper/test-read-midx.c b/t/helper/test-read-midx.c
index e9a444ddba..e48557aba1 100644
--- a/t/helper/test-read-midx.c
+++ b/t/helper/test-read-midx.c
@@ -100,10 +100,36 @@ static int read_midx_preferred_pack(const char *object_dir)
 	return 0;
 }
 
+static int read_midx_bitmapped_packs(const char *object_dir)
+{
+	struct multi_pack_index *midx = NULL;
+	struct bitmapped_pack pack;
+	uint32_t i;
+
+	setup_git_directory();
+
+	midx = load_multi_pack_index(object_dir, 1);
+	if (!midx)
+		return 1;
+
+	for (i = 0; i < midx->num_packs; i++) {
+		if (nth_bitmapped_pack(the_repository, midx, &pack, i) < 0)
+			return 1;
+
+		printf("%s\n", pack_basename(pack.p));
+		printf("  bitmap_pos: %"PRIuMAX"\n", (uintmax_t)pack.bitmap_pos);
+		printf("  bitmap_nr: %"PRIuMAX"\n", (uintmax_t)pack.bitmap_nr);
+	}
+
+	close_midx(midx);
+
+	return 0;
+}
+
 int cmd__read_midx(int argc, const char **argv)
 {
 	if (!(argc == 2 || argc == 3))
-		usage("read-midx [--show-objects|--checksum|--preferred-pack] <object-dir>");
+		usage("read-midx [--show-objects|--checksum|--preferred-pack|--bitmap] <object-dir>");
 
 	if (!strcmp(argv[1], "--show-objects"))
 		return read_midx_file(argv[2], 1);
@@ -111,5 +137,7 @@ int cmd__read_midx(int argc, const char **argv)
 		return read_midx_checksum(argv[2]);
 	else if (!strcmp(argv[1], "--preferred-pack"))
 		return read_midx_preferred_pack(argv[2]);
+	else if (!strcmp(argv[1], "--bitmap"))
+		return read_midx_bitmapped_packs(argv[2]);
 	return read_midx_file(argv[1], 0);
 }
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index c20aafe99a..dd09134db0 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -1171,4 +1171,39 @@ test_expect_success 'reader notices out-of-bounds fanout' '
 	test_cmp expect err
 '
 
+test_expect_success 'bitmapped packs are stored via the BTMP chunk' '
+	test_when_finished "rm -fr repo" &&
+	git init repo &&
+	(
+		cd repo &&
+
+		for i in 1 2 3 4 5
+		do
+			test_commit "$i" &&
+			git repack -d || return 1
+		done &&
+
+		find $objdir/pack -type f -name "*.idx" | xargs -n 1 basename |
+		sort >packs &&
+
+		git multi-pack-index write --stdin-packs <packs &&
+		test_must_fail test-tool read-midx --bitmap $objdir 2>err &&
+		cat >expect <<-\EOF &&
+		error: MIDX does not contain the BTMP chunk
+		EOF
+		test_cmp expect err &&
+
+		git multi-pack-index write --stdin-packs --bitmap \
+			--preferred-pack="$(head -n1 <packs)" <packs  &&
+		test-tool read-midx --bitmap $objdir >actual &&
+		for i in $(test_seq $(wc -l <packs))
+		do
+			sed -ne "${i}s/\.idx$/\.pack/p" packs &&
+			echo "  bitmap_pos: $((($i - 1) * 3))" &&
+			echo "  bitmap_nr: 3" || return 1
+		done >expect &&
+		test_cmp expect actual
+	)
+'
+
 test_done
-- 
2.43.0.102.ga31d690331.dirty


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH v2 06/26] midx: implement `midx_locate_pack()`
  2023-12-14 22:23 ` [PATCH v2 00/26] " Taylor Blau
                     ` (4 preceding siblings ...)
  2023-12-14 22:23   ` [PATCH v2 05/26] midx: implement `BTMP` chunk Taylor Blau
@ 2023-12-14 22:23   ` Taylor Blau
  2023-12-14 22:23   ` [PATCH v2 07/26] pack-bitmap: pass `bitmapped_pack` struct to pack-reuse functions Taylor Blau
                     ` (20 subsequent siblings)
  26 siblings, 0 replies; 107+ messages in thread
From: Taylor Blau @ 2023-12-14 22:23 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano

The multi-pack index API exposes a `midx_contains_pack()` function that
takes in a string ending in either ".idx" or ".pack" and returns whether
or not the MIDX contains a given pack corresponding to that string.

There is no corresponding function to locate the position of a pack
within the MIDX's pack order (sorted lexically by pack filename).

We could add an optional out parameter to `midx_contains_pack()` that is
filled out with the pack's position when the parameter is non-NULL. To
minimize the amount of fallout from this change, instead introduce a new
function by renaming `midx_contains_pack()` to `midx_locate_pack()`,
adding that output parameter, and then reimplementing
`midx_contains_pack()` in terms of it.

Future patches will make use of this new function.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 midx.c | 13 +++++++++++--
 midx.h |  5 ++++-
 2 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/midx.c b/midx.c
index de25612b0c..beaf0c0de4 100644
--- a/midx.c
+++ b/midx.c
@@ -428,7 +428,8 @@ static int cmp_idx_or_pack_name(const char *idx_or_pack_name,
 	return strcmp(idx_or_pack_name, idx_name);
 }
 
-int midx_contains_pack(struct multi_pack_index *m, const char *idx_or_pack_name)
+int midx_locate_pack(struct multi_pack_index *m, const char *idx_or_pack_name,
+		     uint32_t *pos)
 {
 	uint32_t first = 0, last = m->num_packs;
 
@@ -439,8 +440,11 @@ int midx_contains_pack(struct multi_pack_index *m, const char *idx_or_pack_name)
 
 		current = m->pack_names[mid];
 		cmp = cmp_idx_or_pack_name(idx_or_pack_name, current);
-		if (!cmp)
+		if (!cmp) {
+			if (pos)
+				*pos = mid;
 			return 1;
+		}
 		if (cmp > 0) {
 			first = mid + 1;
 			continue;
@@ -451,6 +455,11 @@ int midx_contains_pack(struct multi_pack_index *m, const char *idx_or_pack_name)
 	return 0;
 }
 
+int midx_contains_pack(struct multi_pack_index *m, const char *idx_or_pack_name)
+{
+	return midx_locate_pack(m, idx_or_pack_name, NULL);
+}
+
 int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, int local)
 {
 	struct multi_pack_index *m;
diff --git a/midx.h b/midx.h
index b404235db5..89c5aa637e 100644
--- a/midx.h
+++ b/midx.h
@@ -70,7 +70,10 @@ struct object_id *nth_midxed_object_oid(struct object_id *oid,
 					struct multi_pack_index *m,
 					uint32_t n);
 int fill_midx_entry(struct repository *r, const struct object_id *oid, struct pack_entry *e, struct multi_pack_index *m);
-int midx_contains_pack(struct multi_pack_index *m, const char *idx_or_pack_name);
+int midx_contains_pack(struct multi_pack_index *m,
+		       const char *idx_or_pack_name);
+int midx_locate_pack(struct multi_pack_index *m, const char *idx_or_pack_name,
+		     uint32_t *pos);
 int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, int local);
 
 /*
-- 
2.43.0.102.ga31d690331.dirty


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH v2 07/26] pack-bitmap: pass `bitmapped_pack` struct to pack-reuse functions
  2023-12-14 22:23 ` [PATCH v2 00/26] " Taylor Blau
                     ` (5 preceding siblings ...)
  2023-12-14 22:23   ` [PATCH v2 06/26] midx: implement `midx_locate_pack()` Taylor Blau
@ 2023-12-14 22:23   ` Taylor Blau
  2023-12-14 22:23   ` [PATCH v2 08/26] ewah: implement `bitmap_is_empty()` Taylor Blau
                     ` (19 subsequent siblings)
  26 siblings, 0 replies; 107+ messages in thread
From: Taylor Blau @ 2023-12-14 22:23 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano

When trying to assemble a pack with bitmaps using `--use-bitmap-index`,
`pack-objects` asks the pack-bitmap machinery for a bitmap which
indicates the set of objects we can "reuse" verbatim from on-disk.

This set is roughly comprised of: a prefix of objects in the bitmapped
pack (or preferred pack, in the case of a multi-pack reachability
bitmap), plus any other objects not included in the prefix, excluding
any deltas whose base we are not sending in the resulting pack.

The pack-bitmap machinery is responsible for computing this bitmap, and
does so with the following functions:

  - reuse_partial_packfile_from_bitmap()
  - try_partial_reuse()

In the existing implementation, the first function is responsible for
(a) marking the prefix of objects in the reusable pack, and then (b)
calling try_partial_reuse() on any remaining objects to ensure that they
are also reusable (and removing them from the bitmapped set if they are
not).

Likewise, the `try_partial_reuse()` function is responsible for checking
whether an isolated object (that is, an object from the bitmapped
pack/preferred pack not contained in the prefix from earlier) may be
reused, i.e. that it isn't a delta of an object that we are not sending
in the resulting pack.

These functions are based on two core assumptions, which we will unwind
in this and the following commits:

  1. There is only a single pack from the bitmap which is eligible for
     verbatim pack-reuse. For single-pack bitmaps, this is trivially the
     bitmapped pack. For multi-pack bitmaps, this is (currently) the
     MIDX's preferred pack.

  2. The pack eligible for reuse has its first object in bit position 0,
     and all objects from that pack follow in pack-order from that first
     bit position.

In order to perform verbatim pack reuse over multiple packs, we must
unwind these two assumptions. Most notably, in order to reuse bits from
a given packfile, we need to know the first bit position occupied by
an object form that packfile. To propagate this information around, pass
a `struct bitmapped_pack *` anywhere we previously passed a `struct
packed_git *`, since the former contains the bitmap position we're
interested in (as well as a pointer to the latter).

As an additional step, factor out a sub-routine from the main
`reuse_partial_packfile_from_bitmap()` function, called
`reuse_partial_packfile_from_bitmap_1()`. This new function will be
responsible for figuring out which objects may be reused from a single
pack, and the existing function will dispatch multiple calls to its new
helper function for each reusable pack.

Consequently, `reuse_partial_packfile_from_bitmap()` will now maintain
an array of reusable packs instead of a single such pack. We currently
expect that array to have only a single element, so this awkward state
is short-lived. It will serve as useful scaffolding in subsequent
commits as we begin to work towards enabling multi-pack reuse.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-bitmap.c | 118 +++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 87 insertions(+), 31 deletions(-)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index d2f1306960..d64a80c30c 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -1836,7 +1836,7 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
  * -1 means "stop trying further objects"; 0 means we may or may not have
  * reused, but you can keep feeding bits.
  */
-static int try_partial_reuse(struct packed_git *pack,
+static int try_partial_reuse(struct bitmapped_pack *pack,
 			     size_t pos,
 			     struct bitmap *reuse,
 			     struct pack_window **w_curs)
@@ -1868,11 +1868,11 @@ static int try_partial_reuse(struct packed_git *pack,
 	 * preferred pack precede all bits from other packs.
 	 */
 
-	if (pos >= pack->num_objects)
+	if (pos >= pack->p->num_objects)
 		return -1; /* not actually in the pack or MIDX preferred pack */
 
-	offset = delta_obj_offset = pack_pos_to_offset(pack, pos);
-	type = unpack_object_header(pack, w_curs, &offset, &size);
+	offset = delta_obj_offset = pack_pos_to_offset(pack->p, pos);
+	type = unpack_object_header(pack->p, w_curs, &offset, &size);
 	if (type < 0)
 		return -1; /* broken packfile, punt */
 
@@ -1888,11 +1888,11 @@ static int try_partial_reuse(struct packed_git *pack,
 		 * and the normal slow path will complain about it in
 		 * more detail.
 		 */
-		base_offset = get_delta_base(pack, w_curs, &offset, type,
+		base_offset = get_delta_base(pack->p, w_curs, &offset, type,
 					     delta_obj_offset);
 		if (!base_offset)
 			return 0;
-		if (offset_to_pack_pos(pack, base_offset, &base_pos) < 0)
+		if (offset_to_pack_pos(pack->p, base_offset, &base_pos) < 0)
 			return 0;
 
 		/*
@@ -1915,14 +1915,14 @@ static int try_partial_reuse(struct packed_git *pack,
 		 * to REF_DELTA on the fly. Better to just let the normal
 		 * object_entry code path handle it.
 		 */
-		if (!bitmap_get(reuse, base_pos))
+		if (!bitmap_get(reuse, pack->bitmap_pos + base_pos))
 			return 0;
 	}
 
 	/*
 	 * If we got here, then the object is OK to reuse. Mark it.
 	 */
-	bitmap_set(reuse, pos);
+	bitmap_set(reuse, pack->bitmap_pos + pos);
 	return 0;
 }
 
@@ -1934,29 +1934,13 @@ uint32_t midx_preferred_pack(struct bitmap_index *bitmap_git)
 	return nth_midxed_pack_int_id(m, pack_pos_to_midx(bitmap_git->midx, 0));
 }
 
-int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
-				       struct packed_git **packfile_out,
-				       uint32_t *entries,
-				       struct bitmap **reuse_out)
+static void reuse_partial_packfile_from_bitmap_1(struct bitmap_index *bitmap_git,
+						 struct bitmapped_pack *pack,
+						 struct bitmap *reuse)
 {
-	struct repository *r = the_repository;
-	struct packed_git *pack;
 	struct bitmap *result = bitmap_git->result;
-	struct bitmap *reuse;
 	struct pack_window *w_curs = NULL;
 	size_t i = 0;
-	uint32_t offset;
-	uint32_t objects_nr;
-
-	assert(result);
-
-	load_reverse_index(r, bitmap_git);
-
-	if (bitmap_is_midx(bitmap_git))
-		pack = bitmap_git->midx->packs[midx_preferred_pack(bitmap_git)];
-	else
-		pack = bitmap_git->pack;
-	objects_nr = pack->num_objects;
 
 	while (i < result->word_alloc && result->words[i] == (eword_t)~0)
 		i++;
@@ -1969,15 +1953,15 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 	 * we use it instead of another pack. In single-pack bitmaps, the choice
 	 * is made for us.
 	 */
-	if (i > objects_nr / BITS_IN_EWORD)
-		i = objects_nr / BITS_IN_EWORD;
+	if (i > pack->p->num_objects / BITS_IN_EWORD)
+		i = pack->p->num_objects / BITS_IN_EWORD;
 
-	reuse = bitmap_word_alloc(i);
 	memset(reuse->words, 0xFF, i * sizeof(eword_t));
 
 	for (; i < result->word_alloc; ++i) {
 		eword_t word = result->words[i];
 		size_t pos = (i * BITS_IN_EWORD);
+		size_t offset;
 
 		for (offset = 0; offset < BITS_IN_EWORD; ++offset) {
 			if ((word >> offset) == 0)
@@ -2002,6 +1986,78 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 
 done:
 	unuse_pack(&w_curs);
+}
+
+static int bitmapped_pack_cmp(const void *va, const void *vb)
+{
+	const struct bitmapped_pack *a = va;
+	const struct bitmapped_pack *b = vb;
+
+	if (a->bitmap_pos < b->bitmap_pos)
+		return -1;
+	if (a->bitmap_pos > b->bitmap_pos)
+		return 1;
+	return 0;
+}
+
+int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
+				       struct packed_git **packfile_out,
+				       uint32_t *entries,
+				       struct bitmap **reuse_out)
+{
+	struct repository *r = the_repository;
+	struct bitmapped_pack *packs = NULL;
+	struct bitmap *result = bitmap_git->result;
+	struct bitmap *reuse;
+	size_t i;
+	size_t packs_nr = 0, packs_alloc = 0;
+	size_t word_alloc;
+	uint32_t objects_nr = 0;
+
+	assert(result);
+
+	load_reverse_index(r, bitmap_git);
+
+	if (bitmap_is_midx(bitmap_git)) {
+		for (i = 0; i < bitmap_git->midx->num_packs; i++) {
+			struct bitmapped_pack pack;
+			if (nth_bitmapped_pack(r, bitmap_git->midx, &pack, i) < 0) {
+				warning(_("unable to load pack: '%s', disabling pack-reuse"),
+					bitmap_git->midx->pack_names[i]);
+				free(packs);
+				return -1;
+			}
+			if (!pack.bitmap_nr)
+				continue; /* no objects from this pack */
+			if (pack.bitmap_pos)
+				continue; /* not preferred pack */
+
+			ALLOC_GROW(packs, packs_nr + 1, packs_alloc);
+			memcpy(&packs[packs_nr++], &pack, sizeof(pack));
+
+			objects_nr += pack.p->num_objects;
+		}
+
+		QSORT(packs, packs_nr, bitmapped_pack_cmp);
+	} else {
+		ALLOC_GROW(packs, packs_nr + 1, packs_alloc);
+
+		packs[packs_nr].p = bitmap_git->pack;
+		packs[packs_nr].bitmap_pos = 0;
+		packs[packs_nr].bitmap_nr = bitmap_git->pack->num_objects;
+
+		objects_nr = packs[packs_nr++].p->num_objects;
+	}
+
+	word_alloc = objects_nr / BITS_IN_EWORD;
+	if (objects_nr % BITS_IN_EWORD)
+		word_alloc++;
+	reuse = bitmap_word_alloc(word_alloc);
+
+	if (packs_nr != 1)
+		BUG("pack reuse not yet implemented for multiple packs");
+
+	reuse_partial_packfile_from_bitmap_1(bitmap_git, packs, reuse);
 
 	*entries = bitmap_popcount(reuse);
 	if (!*entries) {
@@ -2014,7 +2070,7 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 	 * need to be handled separately.
 	 */
 	bitmap_and_not(result, reuse);
-	*packfile_out = pack;
+	*packfile_out = packs[0].p;
 	*reuse_out = reuse;
 	return 0;
 }
-- 
2.43.0.102.ga31d690331.dirty


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH v2 08/26] ewah: implement `bitmap_is_empty()`
  2023-12-14 22:23 ` [PATCH v2 00/26] " Taylor Blau
                     ` (6 preceding siblings ...)
  2023-12-14 22:23   ` [PATCH v2 07/26] pack-bitmap: pass `bitmapped_pack` struct to pack-reuse functions Taylor Blau
@ 2023-12-14 22:23   ` Taylor Blau
  2023-12-14 22:24   ` [PATCH v2 09/26] pack-bitmap: simplify `reuse_partial_packfile_from_bitmap()` signature Taylor Blau
                     ` (18 subsequent siblings)
  26 siblings, 0 replies; 107+ messages in thread
From: Taylor Blau @ 2023-12-14 22:23 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano

In a future commit, we will want to check whether or not a bitmap has
any bits set in any of its words. The best way to do this (prior to the
existence of this patch) is to call `bitmap_popcount()` and check
whether the result is non-zero.

But this is semi-wasteful, since we do not need to know the exact number
of bits set, only whether or not there is at least one of them.

Implement a new helper function to check just that.

Suggested-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 ewah/bitmap.c | 9 +++++++++
 ewah/ewok.h   | 1 +
 2 files changed, 10 insertions(+)

diff --git a/ewah/bitmap.c b/ewah/bitmap.c
index 7b525b1ecd..ac7e0af622 100644
--- a/ewah/bitmap.c
+++ b/ewah/bitmap.c
@@ -169,6 +169,15 @@ size_t bitmap_popcount(struct bitmap *self)
 	return count;
 }
 
+int bitmap_is_empty(struct bitmap *self)
+{
+	size_t i;
+	for (i = 0; i < self->word_alloc; i++)
+		if (self->words[i])
+			return 0;
+	return 1;
+}
+
 int bitmap_equals(struct bitmap *self, struct bitmap *other)
 {
 	struct bitmap *big, *small;
diff --git a/ewah/ewok.h b/ewah/ewok.h
index 7eb8b9b630..c11d76c6f3 100644
--- a/ewah/ewok.h
+++ b/ewah/ewok.h
@@ -189,5 +189,6 @@ void bitmap_or_ewah(struct bitmap *self, struct ewah_bitmap *other);
 void bitmap_or(struct bitmap *self, const struct bitmap *other);
 
 size_t bitmap_popcount(struct bitmap *self);
+int bitmap_is_empty(struct bitmap *self);
 
 #endif
-- 
2.43.0.102.ga31d690331.dirty


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH v2 09/26] pack-bitmap: simplify `reuse_partial_packfile_from_bitmap()` signature
  2023-12-14 22:23 ` [PATCH v2 00/26] " Taylor Blau
                     ` (7 preceding siblings ...)
  2023-12-14 22:23   ` [PATCH v2 08/26] ewah: implement `bitmap_is_empty()` Taylor Blau
@ 2023-12-14 22:24   ` Taylor Blau
  2023-12-14 22:24   ` [PATCH v2 10/26] pack-bitmap: return multiple packs via `reuse_partial_packfile_from_bitmap()` Taylor Blau
                     ` (17 subsequent siblings)
  26 siblings, 0 replies; 107+ messages in thread
From: Taylor Blau @ 2023-12-14 22:24 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano

The signature of `reuse_partial_packfile_from_bitmap()` currently takes
in a bitmap, as well as three output parameters (filled through
pointers, and passed as arguments), and also returns an integer result.

The output parameters are filled out with: (a) the packfile used for
pack-reuse, (b) the number of objects from that pack that we can reuse,
and (c) a bitmap indicating which objects we can reuse. The return value
is either -1 (when there are no objects to reuse), or 0 (when there is
at least one object to reuse).

Some of these parameters are redundant. Notably, we can infer from the
bitmap how many objects are reused by calling bitmap_popcount(). And we
can similar compute the return value based on that number as well.

As such, clean up the signature of this function to drop the "*entries"
parameter, as well as the int return value, since the single caller of
this function can infer these values themself.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 16 +++++++++-------
 pack-bitmap.c          | 16 +++++++---------
 pack-bitmap.h          |  7 +++----
 3 files changed, 19 insertions(+), 20 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 321d7effb0..c3df6d9657 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3943,13 +3943,15 @@ static int get_object_list_from_bitmap(struct rev_info *revs)
 	if (!(bitmap_git = prepare_bitmap_walk(revs, 0)))
 		return -1;
 
-	if (pack_options_allow_reuse() &&
-	    !reuse_partial_packfile_from_bitmap(
-			bitmap_git,
-			&reuse_packfile,
-			&reuse_packfile_objects,
-			&reuse_packfile_bitmap)) {
-		assert(reuse_packfile_objects);
+	if (pack_options_allow_reuse())
+		reuse_partial_packfile_from_bitmap(bitmap_git, &reuse_packfile,
+						   &reuse_packfile_bitmap);
+
+	if (reuse_packfile) {
+		reuse_packfile_objects = bitmap_popcount(reuse_packfile_bitmap);
+		if (!reuse_packfile_objects)
+			BUG("expected non-empty reuse bitmap");
+
 		nr_result += reuse_packfile_objects;
 		nr_seen += reuse_packfile_objects;
 		display_progress(progress_state, nr_seen);
diff --git a/pack-bitmap.c b/pack-bitmap.c
index d64a80c30c..c75a83e9cc 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -2000,10 +2000,9 @@ static int bitmapped_pack_cmp(const void *va, const void *vb)
 	return 0;
 }
 
-int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
-				       struct packed_git **packfile_out,
-				       uint32_t *entries,
-				       struct bitmap **reuse_out)
+void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
+					struct packed_git **packfile_out,
+					struct bitmap **reuse_out)
 {
 	struct repository *r = the_repository;
 	struct bitmapped_pack *packs = NULL;
@@ -2025,7 +2024,7 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 				warning(_("unable to load pack: '%s', disabling pack-reuse"),
 					bitmap_git->midx->pack_names[i]);
 				free(packs);
-				return -1;
+				return;
 			}
 			if (!pack.bitmap_nr)
 				continue; /* no objects from this pack */
@@ -2059,10 +2058,10 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 
 	reuse_partial_packfile_from_bitmap_1(bitmap_git, packs, reuse);
 
-	*entries = bitmap_popcount(reuse);
-	if (!*entries) {
+	if (bitmap_is_empty(reuse)) {
+		free(packs);
 		bitmap_free(reuse);
-		return -1;
+		return;
 	}
 
 	/*
@@ -2072,7 +2071,6 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 	bitmap_and_not(result, reuse);
 	*packfile_out = packs[0].p;
 	*reuse_out = reuse;
-	return 0;
 }
 
 int bitmap_walk_contains(struct bitmap_index *bitmap_git,
diff --git a/pack-bitmap.h b/pack-bitmap.h
index b68b213388..ab3fdcde6b 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -78,10 +78,9 @@ int test_bitmap_hashes(struct repository *r);
 struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
 					 int filter_provided_objects);
 uint32_t midx_preferred_pack(struct bitmap_index *bitmap_git);
-int reuse_partial_packfile_from_bitmap(struct bitmap_index *,
-				       struct packed_git **packfile,
-				       uint32_t *entries,
-				       struct bitmap **reuse_out);
+void reuse_partial_packfile_from_bitmap(struct bitmap_index *,
+					struct packed_git **packfile,
+					struct bitmap **reuse_out);
 int rebuild_existing_bitmaps(struct bitmap_index *, struct packing_data *mapping,
 			     kh_oid_map_t *reused_bitmaps, int show_progress);
 void free_bitmap_index(struct bitmap_index *);
-- 
2.43.0.102.ga31d690331.dirty


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH v2 10/26] pack-bitmap: return multiple packs via `reuse_partial_packfile_from_bitmap()`
  2023-12-14 22:23 ` [PATCH v2 00/26] " Taylor Blau
                     ` (8 preceding siblings ...)
  2023-12-14 22:24   ` [PATCH v2 09/26] pack-bitmap: simplify `reuse_partial_packfile_from_bitmap()` signature Taylor Blau
@ 2023-12-14 22:24   ` Taylor Blau
  2023-12-14 22:24   ` [PATCH v2 11/26] pack-objects: parameterize pack-reuse routines over a single pack Taylor Blau
                     ` (16 subsequent siblings)
  26 siblings, 0 replies; 107+ messages in thread
From: Taylor Blau @ 2023-12-14 22:24 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano

Further prepare for enabling verbatim pack-reuse over multiple packfiles
by changing the signature of reuse_partial_packfile_from_bitmap() to
populate an array of `struct bitmapped_pack *`'s instead of a pointer to
a single packfile.

Since the array we're filling out is sized dynamically[^1], add an
additional `size_t *` parameter which will hold the number of reusable
packs (equal to the number of elements in the array).

Note that since we still have not implemented true multi-pack reuse,
these changes aren't propagated out to the rest of the caller in
builtin/pack-objects.c.

In the interim state, we expect that the array has a single element, and
we use that element to fill out the static `reuse_packfile` variable
(which is a bog-standard `struct packed_git *`). Future commits will
continue to push this change further out through the pack-objects code.

[^1]: That is, even though we know the number of packs which are
  candidates for pack-reuse, we do not know how many of those
  candidates we can actually reuse.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 9 +++++++--
 pack-bitmap.c          | 6 ++++--
 pack-bitmap.h          | 5 +++--
 3 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index c3df6d9657..87e16636a8 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3940,14 +3940,19 @@ static int pack_options_allow_reuse(void)
 
 static int get_object_list_from_bitmap(struct rev_info *revs)
 {
+	struct bitmapped_pack *packs = NULL;
+	size_t packs_nr = 0;
+
 	if (!(bitmap_git = prepare_bitmap_walk(revs, 0)))
 		return -1;
 
 	if (pack_options_allow_reuse())
-		reuse_partial_packfile_from_bitmap(bitmap_git, &reuse_packfile,
+		reuse_partial_packfile_from_bitmap(bitmap_git, &packs,
+						   &packs_nr,
 						   &reuse_packfile_bitmap);
 
-	if (reuse_packfile) {
+	if (packs) {
+		reuse_packfile = packs[0].p;
 		reuse_packfile_objects = bitmap_popcount(reuse_packfile_bitmap);
 		if (!reuse_packfile_objects)
 			BUG("expected non-empty reuse bitmap");
diff --git a/pack-bitmap.c b/pack-bitmap.c
index c75a83e9cc..4d5a484678 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -2001,7 +2001,8 @@ static int bitmapped_pack_cmp(const void *va, const void *vb)
 }
 
 void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
-					struct packed_git **packfile_out,
+					struct bitmapped_pack **packs_out,
+					size_t *packs_nr_out,
 					struct bitmap **reuse_out)
 {
 	struct repository *r = the_repository;
@@ -2069,7 +2070,8 @@ void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 	 * need to be handled separately.
 	 */
 	bitmap_and_not(result, reuse);
-	*packfile_out = packs[0].p;
+	*packs_out = packs;
+	*packs_nr_out = packs_nr;
 	*reuse_out = reuse;
 }
 
diff --git a/pack-bitmap.h b/pack-bitmap.h
index ab3fdcde6b..7a12a2ce81 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -78,8 +78,9 @@ int test_bitmap_hashes(struct repository *r);
 struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
 					 int filter_provided_objects);
 uint32_t midx_preferred_pack(struct bitmap_index *bitmap_git);
-void reuse_partial_packfile_from_bitmap(struct bitmap_index *,
-					struct packed_git **packfile,
+void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
+					struct bitmapped_pack **packs_out,
+					size_t *packs_nr_out,
 					struct bitmap **reuse_out);
 int rebuild_existing_bitmaps(struct bitmap_index *, struct packing_data *mapping,
 			     kh_oid_map_t *reused_bitmaps, int show_progress);
-- 
2.43.0.102.ga31d690331.dirty


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH v2 11/26] pack-objects: parameterize pack-reuse routines over a single pack
  2023-12-14 22:23 ` [PATCH v2 00/26] " Taylor Blau
                     ` (9 preceding siblings ...)
  2023-12-14 22:24   ` [PATCH v2 10/26] pack-bitmap: return multiple packs via `reuse_partial_packfile_from_bitmap()` Taylor Blau
@ 2023-12-14 22:24   ` Taylor Blau
  2023-12-14 22:24   ` [PATCH v2 12/26] pack-objects: keep track of `pack_start` for each reuse pack Taylor Blau
                     ` (15 subsequent siblings)
  26 siblings, 0 replies; 107+ messages in thread
From: Taylor Blau @ 2023-12-14 22:24 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano

The routines pack-objects uses to perform verbatim pack-reuse are:

  - write_reused_pack_one()
  - write_reused_pack_verbatim()
  - write_reused_pack()

, all of which assume that there is exactly one packfile being reused:
the global constant `reuse_packfile`.

Prepare for reusing objects from multiple packs by making reuse packfile
a parameter of each of the above functions in preparation for calling
these functions in a loop with multiple packfiles.

Note that we still have the global "reuse_packfile", but pass it through
each of the above function's parameter lists, eliminating all but one
direct access (the top-level caller in `write_pack_file()`). Even after
this series, we will still have a global, but it will hold the array of
reusable packfiles, and we'll pass them one at a time to these functions
in a loop.

Note also that we will eventually need to pass a `bitmapped_pack`
instead of a `packed_git` in order to hold onto additional information
required for reuse (such as the bit position of the first object
belonging to that pack). But that change will be made in a future commit
so as to minimize the noise below as much as possible.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 87e16636a8..102fe9a4f8 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1013,7 +1013,8 @@ static off_t find_reused_offset(off_t where)
 	return reused_chunks[lo-1].difference;
 }
 
-static void write_reused_pack_one(size_t pos, struct hashfile *out,
+static void write_reused_pack_one(struct packed_git *reuse_packfile,
+				  size_t pos, struct hashfile *out,
 				  struct pack_window **w_curs)
 {
 	off_t offset, next, cur;
@@ -1091,7 +1092,8 @@ static void write_reused_pack_one(size_t pos, struct hashfile *out,
 	copy_pack_data(out, reuse_packfile, w_curs, offset, next - offset);
 }
 
-static size_t write_reused_pack_verbatim(struct hashfile *out,
+static size_t write_reused_pack_verbatim(struct packed_git *reuse_packfile,
+					 struct hashfile *out,
 					 struct pack_window **w_curs)
 {
 	size_t pos = 0;
@@ -1118,14 +1120,15 @@ static size_t write_reused_pack_verbatim(struct hashfile *out,
 	return pos;
 }
 
-static void write_reused_pack(struct hashfile *f)
+static void write_reused_pack(struct packed_git *reuse_packfile,
+			      struct hashfile *f)
 {
 	size_t i = 0;
 	uint32_t offset;
 	struct pack_window *w_curs = NULL;
 
 	if (allow_ofs_delta)
-		i = write_reused_pack_verbatim(f, &w_curs);
+		i = write_reused_pack_verbatim(reuse_packfile, f, &w_curs);
 
 	for (; i < reuse_packfile_bitmap->word_alloc; ++i) {
 		eword_t word = reuse_packfile_bitmap->words[i];
@@ -1141,7 +1144,8 @@ static void write_reused_pack(struct hashfile *f)
 			 * bitmaps. See comment in try_partial_reuse()
 			 * for why.
 			 */
-			write_reused_pack_one(pos + offset, f, &w_curs);
+			write_reused_pack_one(reuse_packfile, pos + offset, f,
+					      &w_curs);
 			display_progress(progress_state, ++written);
 		}
 	}
@@ -1199,7 +1203,7 @@ static void write_pack_file(void)
 
 		if (reuse_packfile) {
 			assert(pack_to_stdout);
-			write_reused_pack(f);
+			write_reused_pack(reuse_packfile, f);
 			offset = hashfile_total(f);
 		}
 
-- 
2.43.0.102.ga31d690331.dirty


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH v2 12/26] pack-objects: keep track of `pack_start` for each reuse pack
  2023-12-14 22:23 ` [PATCH v2 00/26] " Taylor Blau
                     ` (10 preceding siblings ...)
  2023-12-14 22:24   ` [PATCH v2 11/26] pack-objects: parameterize pack-reuse routines over a single pack Taylor Blau
@ 2023-12-14 22:24   ` Taylor Blau
  2023-12-14 22:24   ` [PATCH v2 13/26] pack-objects: pass `bitmapped_pack`'s to pack-reuse functions Taylor Blau
                     ` (14 subsequent siblings)
  26 siblings, 0 replies; 107+ messages in thread
From: Taylor Blau @ 2023-12-14 22:24 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano

When reusing objects from a pack, we keep track of a set of one or more
`reused_chunk`s, corresponding to sections of one or more object(s) from
a source pack that we are reusing. Each chunk contains two pieces of
information:

  - the offset of the first object in the source pack (relative to the
    beginning of the source pack)
  - the difference between that offset, and the corresponding offset in
    the pack we're generating

The purpose of keeping track of these is so that we can patch an
OFS_DELTAs that cross over a section of the reuse pack that we didn't
take.

For instance, consider a hypothetical pack as shown below:

                                                (chunk #2)
                                                __________...
                                               /
                                              /
      +--------+---------+-------------------+---------+
  ... | <base> | <other> |      (unused)     | <delta> | ...
      +--------+---------+-------------------+---------+
       \                /
        \______________/
           (chunk #1)

Suppose that we are sending objects "base", "other", and "delta", and
that the "delta" object is stored as an OFS_DELTA, and that its base is
"base". If we don't send any objects in the "(unused)" range, we can't
copy the delta'd object directly, since its delta offset includes a
range of the pack that we didn't copy, so we have to account for that
difference when patching and reassembling the delta.

In order to compute this value correctly, we need to know not only where
we are in the packfile we're assembling (with `hashfile_total(f)`) but
also the position of the first byte of the packfile that we are
currently reusing. Currently, this works just fine, since when reusing
only a single pack those two values are always identical (because
verbatim reuse is the first thing pack-objects does when enabled after
writing the pack header).

But when reusing multiple packs which have one or more gaps, we'll need
to account for these two values diverging.

Together, these two allow us to compute the reused chunk's offset
difference relative to the start of the reused pack, as desired.

Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 102fe9a4f8..f51b86d99f 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1015,6 +1015,7 @@ static off_t find_reused_offset(off_t where)
 
 static void write_reused_pack_one(struct packed_git *reuse_packfile,
 				  size_t pos, struct hashfile *out,
+				  off_t pack_start,
 				  struct pack_window **w_curs)
 {
 	off_t offset, next, cur;
@@ -1024,7 +1025,8 @@ static void write_reused_pack_one(struct packed_git *reuse_packfile,
 	offset = pack_pos_to_offset(reuse_packfile, pos);
 	next = pack_pos_to_offset(reuse_packfile, pos + 1);
 
-	record_reused_object(offset, offset - hashfile_total(out));
+	record_reused_object(offset,
+			     offset - (hashfile_total(out) - pack_start));
 
 	cur = offset;
 	type = unpack_object_header(reuse_packfile, w_curs, &cur, &size);
@@ -1094,6 +1096,7 @@ static void write_reused_pack_one(struct packed_git *reuse_packfile,
 
 static size_t write_reused_pack_verbatim(struct packed_git *reuse_packfile,
 					 struct hashfile *out,
+					 off_t pack_start UNUSED,
 					 struct pack_window **w_curs)
 {
 	size_t pos = 0;
@@ -1125,10 +1128,12 @@ static void write_reused_pack(struct packed_git *reuse_packfile,
 {
 	size_t i = 0;
 	uint32_t offset;
+	off_t pack_start = hashfile_total(f) - sizeof(struct pack_header);
 	struct pack_window *w_curs = NULL;
 
 	if (allow_ofs_delta)
-		i = write_reused_pack_verbatim(reuse_packfile, f, &w_curs);
+		i = write_reused_pack_verbatim(reuse_packfile, f, pack_start,
+					       &w_curs);
 
 	for (; i < reuse_packfile_bitmap->word_alloc; ++i) {
 		eword_t word = reuse_packfile_bitmap->words[i];
@@ -1145,7 +1150,7 @@ static void write_reused_pack(struct packed_git *reuse_packfile,
 			 * for why.
 			 */
 			write_reused_pack_one(reuse_packfile, pos + offset, f,
-					      &w_curs);
+					      pack_start, &w_curs);
 			display_progress(progress_state, ++written);
 		}
 	}
-- 
2.43.0.102.ga31d690331.dirty


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH v2 13/26] pack-objects: pass `bitmapped_pack`'s to pack-reuse functions
  2023-12-14 22:23 ` [PATCH v2 00/26] " Taylor Blau
                     ` (11 preceding siblings ...)
  2023-12-14 22:24   ` [PATCH v2 12/26] pack-objects: keep track of `pack_start` for each reuse pack Taylor Blau
@ 2023-12-14 22:24   ` Taylor Blau
  2023-12-14 22:24   ` [PATCH v2 14/26] pack-objects: prepare `write_reused_pack()` for multi-pack reuse Taylor Blau
                     ` (13 subsequent siblings)
  26 siblings, 0 replies; 107+ messages in thread
From: Taylor Blau @ 2023-12-14 22:24 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano

Further prepare pack-objects to perform verbatim pack-reuse over
multiple packfiles by converting functions that take in a pointer to a
`struct packed_git` to instead take in a pointer to a `struct
bitmapped_pack`.

The additional information found in the bitmapped_pack struct (such as
the bit position corresponding to the beginning of the pack) will be
necessary in order to perform verbatim pack-reuse.

Note that we don't use any of the extra pieces of information contained
in the bitmapped_pack struct, so this step is merely preparatory and
does not introduce any functional changes.

Note further that we do not change the argument type to
write_reused_pack_one(). That function is responsible for copying
sections of the packfile directly and optionally patching any OFS_DELTAs
to account for not reusing sections of the packfile in between a delta
and its base.

As such, that function is (and should remain) oblivious to multi-pack
reuse, and does not require any of the extra pieces of information
stored in the bitmapped_pack struct.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 33 +++++++++++++++++----------------
 1 file changed, 17 insertions(+), 16 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index f51b86d99f..07c849b5d4 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -221,7 +221,8 @@ static int thin;
 static int num_preferred_base;
 static struct progress *progress_state;
 
-static struct packed_git *reuse_packfile;
+static struct bitmapped_pack *reuse_packfiles;
+static size_t reuse_packfiles_nr;
 static uint32_t reuse_packfile_objects;
 static struct bitmap *reuse_packfile_bitmap;
 
@@ -1094,7 +1095,7 @@ static void write_reused_pack_one(struct packed_git *reuse_packfile,
 	copy_pack_data(out, reuse_packfile, w_curs, offset, next - offset);
 }
 
-static size_t write_reused_pack_verbatim(struct packed_git *reuse_packfile,
+static size_t write_reused_pack_verbatim(struct bitmapped_pack *reuse_packfile,
 					 struct hashfile *out,
 					 off_t pack_start UNUSED,
 					 struct pack_window **w_curs)
@@ -1109,13 +1110,13 @@ static size_t write_reused_pack_verbatim(struct packed_git *reuse_packfile,
 		off_t to_write;
 
 		written = (pos * BITS_IN_EWORD);
-		to_write = pack_pos_to_offset(reuse_packfile, written)
+		to_write = pack_pos_to_offset(reuse_packfile->p, written)
 			- sizeof(struct pack_header);
 
 		/* We're recording one chunk, not one object. */
 		record_reused_object(sizeof(struct pack_header), 0);
 		hashflush(out);
-		copy_pack_data(out, reuse_packfile, w_curs,
+		copy_pack_data(out, reuse_packfile->p, w_curs,
 			sizeof(struct pack_header), to_write);
 
 		display_progress(progress_state, written);
@@ -1123,7 +1124,7 @@ static size_t write_reused_pack_verbatim(struct packed_git *reuse_packfile,
 	return pos;
 }
 
-static void write_reused_pack(struct packed_git *reuse_packfile,
+static void write_reused_pack(struct bitmapped_pack *reuse_packfile,
 			      struct hashfile *f)
 {
 	size_t i = 0;
@@ -1149,8 +1150,8 @@ static void write_reused_pack(struct packed_git *reuse_packfile,
 			 * bitmaps. See comment in try_partial_reuse()
 			 * for why.
 			 */
-			write_reused_pack_one(reuse_packfile, pos + offset, f,
-					      pack_start, &w_curs);
+			write_reused_pack_one(reuse_packfile->p, pos + offset,
+					      f, pack_start, &w_curs);
 			display_progress(progress_state, ++written);
 		}
 	}
@@ -1206,9 +1207,12 @@ static void write_pack_file(void)
 
 		offset = write_pack_header(f, nr_remaining);
 
-		if (reuse_packfile) {
+		if (reuse_packfiles_nr) {
 			assert(pack_to_stdout);
-			write_reused_pack(reuse_packfile, f);
+			for (j = 0; j < reuse_packfiles_nr; j++) {
+				reused_chunks_nr = 0;
+				write_reused_pack(&reuse_packfiles[j], f);
+			}
 			offset = hashfile_total(f);
 		}
 
@@ -3949,19 +3953,16 @@ static int pack_options_allow_reuse(void)
 
 static int get_object_list_from_bitmap(struct rev_info *revs)
 {
-	struct bitmapped_pack *packs = NULL;
-	size_t packs_nr = 0;
-
 	if (!(bitmap_git = prepare_bitmap_walk(revs, 0)))
 		return -1;
 
 	if (pack_options_allow_reuse())
-		reuse_partial_packfile_from_bitmap(bitmap_git, &packs,
-						   &packs_nr,
+		reuse_partial_packfile_from_bitmap(bitmap_git,
+						   &reuse_packfiles,
+						   &reuse_packfiles_nr,
 						   &reuse_packfile_bitmap);
 
-	if (packs) {
-		reuse_packfile = packs[0].p;
+	if (reuse_packfiles) {
 		reuse_packfile_objects = bitmap_popcount(reuse_packfile_bitmap);
 		if (!reuse_packfile_objects)
 			BUG("expected non-empty reuse bitmap");
-- 
2.43.0.102.ga31d690331.dirty


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH v2 14/26] pack-objects: prepare `write_reused_pack()` for multi-pack reuse
  2023-12-14 22:23 ` [PATCH v2 00/26] " Taylor Blau
                     ` (12 preceding siblings ...)
  2023-12-14 22:24   ` [PATCH v2 13/26] pack-objects: pass `bitmapped_pack`'s to pack-reuse functions Taylor Blau
@ 2023-12-14 22:24   ` Taylor Blau
  2023-12-14 22:24   ` [PATCH v2 15/26] pack-objects: prepare `write_reused_pack_verbatim()` " Taylor Blau
                     ` (12 subsequent siblings)
  26 siblings, 0 replies; 107+ messages in thread
From: Taylor Blau @ 2023-12-14 22:24 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano

The function `write_reused_pack()` within `builtin/pack-objects.c` is
responsible for performing pack-reuse on a single pack, and has two main
functions:

  - it dispatches a call to `write_reused_pack_verbatim()` to see if we
    can reuse portions of the packfile in whole-word chunks

  - for any remaining objects (that is, any objects that appear after
    the first "gap" in the bitmap), call write_reused_pack_one() on that
    object to record it for reuse.

Prepare this function for multi-pack reuse by removing the assumption
that the bit position corresponding to the first object being reused
from a given pack must be at bit position zero.

The changes in this function are mostly straightforward. Initialize `i`
to the position of the first word to contain bits corresponding to that
reuse pack. In most situations, we throw the initialized value away,
since we end up replacing it with the return value from
write_reused_pack_verbatim(), moving us past the section of whole words
that we reused.

Likewise, modify the per-object loop to ignore any bits at the beginning
of the first word that do not belong to the pack currently being reused,
as well as skip to the "done" section once we have processed the last
bit corresponding to this pack.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 07c849b5d4..6ce52d88a9 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1127,7 +1127,7 @@ static size_t write_reused_pack_verbatim(struct bitmapped_pack *reuse_packfile,
 static void write_reused_pack(struct bitmapped_pack *reuse_packfile,
 			      struct hashfile *f)
 {
-	size_t i = 0;
+	size_t i = reuse_packfile->bitmap_pos / BITS_IN_EWORD;
 	uint32_t offset;
 	off_t pack_start = hashfile_total(f) - sizeof(struct pack_header);
 	struct pack_window *w_curs = NULL;
@@ -1145,17 +1145,23 @@ static void write_reused_pack(struct bitmapped_pack *reuse_packfile,
 				break;
 
 			offset += ewah_bit_ctz64(word >> offset);
+			if (pos + offset < reuse_packfile->bitmap_pos)
+				continue;
+			if (pos + offset >= reuse_packfile->bitmap_pos + reuse_packfile->bitmap_nr)
+				goto done;
 			/*
 			 * Can use bit positions directly, even for MIDX
 			 * bitmaps. See comment in try_partial_reuse()
 			 * for why.
 			 */
-			write_reused_pack_one(reuse_packfile->p, pos + offset,
+			write_reused_pack_one(reuse_packfile->p,
+					      pos + offset - reuse_packfile->bitmap_pos,
 					      f, pack_start, &w_curs);
 			display_progress(progress_state, ++written);
 		}
 	}
 
+done:
 	unuse_pack(&w_curs);
 }
 
-- 
2.43.0.102.ga31d690331.dirty


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH v2 15/26] pack-objects: prepare `write_reused_pack_verbatim()` for multi-pack reuse
  2023-12-14 22:23 ` [PATCH v2 00/26] " Taylor Blau
                     ` (13 preceding siblings ...)
  2023-12-14 22:24   ` [PATCH v2 14/26] pack-objects: prepare `write_reused_pack()` for multi-pack reuse Taylor Blau
@ 2023-12-14 22:24   ` Taylor Blau
  2023-12-14 22:24   ` [PATCH v2 16/26] pack-objects: include number of packs reused in output Taylor Blau
                     ` (11 subsequent siblings)
  26 siblings, 0 replies; 107+ messages in thread
From: Taylor Blau @ 2023-12-14 22:24 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano

The function `write_reused_pack_verbatim()` within
`builtin/pack-objects.c` is responsible for writing out a continuous
set of objects beginning at the start of the reuse packfile.

In the existing implementation, we did something like:

    while (pos < reuse_packfile_bitmap->word_alloc &&
           reuse_packfile_bitmap->words[pos] == (eword_t)~0)
      pos++;

    if (pos)
      /* write first `pos * BITS_IN_WORD` objects from pack */

as an optimization to record a single chunk for the longest continuous
prefix of objects wanted out of the reuse pack, instead of having a
chunk for each individual object. For more details, see bb514de356
(pack-objects: improve partial packfile reuse, 2019-12-18).

In order to retain this optimization in a multi-pack reuse world, we can
no longer assume that the first object in a pack is on a word boundary
in the bitmap storing the set of reusable objects.

Assuming that all objects from the beginning of the reuse packfile up to
the object corresponding to the first bit on a word boundary are part of
the result, consume whole words at a time until the last whole word
belonging to the reuse packfile. Copy those objects to the resulting
packfile, and track that we reused them by recording a single chunk.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 73 ++++++++++++++++++++++++++++++++++--------
 1 file changed, 60 insertions(+), 13 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 6ce52d88a9..31053128fc 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1097,31 +1097,78 @@ static void write_reused_pack_one(struct packed_git *reuse_packfile,
 
 static size_t write_reused_pack_verbatim(struct bitmapped_pack *reuse_packfile,
 					 struct hashfile *out,
-					 off_t pack_start UNUSED,
+					 off_t pack_start,
 					 struct pack_window **w_curs)
 {
-	size_t pos = 0;
+	size_t pos = reuse_packfile->bitmap_pos;
+	size_t end;
 
-	while (pos < reuse_packfile_bitmap->word_alloc &&
-			reuse_packfile_bitmap->words[pos] == (eword_t)~0)
-		pos++;
+	if (pos % BITS_IN_EWORD) {
+		size_t word_pos = (pos / BITS_IN_EWORD);
+		size_t offset = pos % BITS_IN_EWORD;
+		size_t last;
+		eword_t word = reuse_packfile_bitmap->words[word_pos];
 
-	if (pos) {
-		off_t to_write;
+		if (offset + reuse_packfile->bitmap_nr < BITS_IN_EWORD)
+			last = offset + reuse_packfile->bitmap_nr;
+		else
+			last = BITS_IN_EWORD;
 
-		written = (pos * BITS_IN_EWORD);
-		to_write = pack_pos_to_offset(reuse_packfile->p, written)
-			- sizeof(struct pack_header);
+		for (; offset < last; offset++) {
+			if (word >> offset == 0)
+				return word_pos;
+			if (!bitmap_get(reuse_packfile_bitmap,
+					word_pos * BITS_IN_EWORD + offset))
+				return word_pos;
+		}
+
+		pos += BITS_IN_EWORD - (pos % BITS_IN_EWORD);
+	}
+
+	/*
+	 * Now we're going to copy as many whole eword_t's as possible.
+	 * "end" is the index of the last whole eword_t we copy, but
+	 * there may be additional bits to process. Those are handled
+	 * individually by write_reused_pack().
+	 *
+	 * Begin by advancing to the first word boundary in range of the
+	 * bit positions occupied by objects in "reuse_packfile". Then
+	 * pick the last word boundary in the same range. If we have at
+	 * least one word's worth of bits to process, continue on.
+	 */
+	end = reuse_packfile->bitmap_pos + reuse_packfile->bitmap_nr;
+	if (end % BITS_IN_EWORD)
+		end -= end % BITS_IN_EWORD;
+	if (pos >= end)
+		return reuse_packfile->bitmap_pos / BITS_IN_EWORD;
+
+	while (pos < end &&
+	       reuse_packfile_bitmap->words[pos / BITS_IN_EWORD] == (eword_t)~0)
+		pos += BITS_IN_EWORD;
+
+	if (pos > end)
+		pos = end;
+
+	if (reuse_packfile->bitmap_pos < pos) {
+		off_t pack_start_off = pack_pos_to_offset(reuse_packfile->p, 0);
+		off_t pack_end_off = pack_pos_to_offset(reuse_packfile->p,
+							pos - reuse_packfile->bitmap_pos);
+
+		written += pos - reuse_packfile->bitmap_pos;
 
 		/* We're recording one chunk, not one object. */
-		record_reused_object(sizeof(struct pack_header), 0);
+		record_reused_object(pack_start_off,
+				     pack_start_off - (hashfile_total(out) - pack_start));
 		hashflush(out);
 		copy_pack_data(out, reuse_packfile->p, w_curs,
-			sizeof(struct pack_header), to_write);
+			pack_start_off, pack_end_off - pack_start_off);
 
 		display_progress(progress_state, written);
 	}
-	return pos;
+	if (pos % BITS_IN_EWORD)
+		BUG("attempted to jump past a word boundary to %"PRIuMAX,
+		    (uintmax_t)pos);
+	return pos / BITS_IN_EWORD;
 }
 
 static void write_reused_pack(struct bitmapped_pack *reuse_packfile,
-- 
2.43.0.102.ga31d690331.dirty


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH v2 16/26] pack-objects: include number of packs reused in output
  2023-12-14 22:23 ` [PATCH v2 00/26] " Taylor Blau
                     ` (14 preceding siblings ...)
  2023-12-14 22:24   ` [PATCH v2 15/26] pack-objects: prepare `write_reused_pack_verbatim()` " Taylor Blau
@ 2023-12-14 22:24   ` Taylor Blau
  2023-12-14 22:24   ` [PATCH v2 17/26] git-compat-util.h: implement checked size_t to uint32_t conversion Taylor Blau
                     ` (10 subsequent siblings)
  26 siblings, 0 replies; 107+ messages in thread
From: Taylor Blau @ 2023-12-14 22:24 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano

In addition to including the number of objects reused verbatim from a
reuse-pack, include the number of packs from which objects were reused.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 31053128fc..7eb035eb7d 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -223,6 +223,7 @@ static struct progress *progress_state;
 
 static struct bitmapped_pack *reuse_packfiles;
 static size_t reuse_packfiles_nr;
+static size_t reuse_packfiles_used_nr;
 static uint32_t reuse_packfile_objects;
 static struct bitmap *reuse_packfile_bitmap;
 
@@ -1265,6 +1266,8 @@ static void write_pack_file(void)
 			for (j = 0; j < reuse_packfiles_nr; j++) {
 				reused_chunks_nr = 0;
 				write_reused_pack(&reuse_packfiles[j], f);
+				if (reused_chunks_nr)
+					reuse_packfiles_used_nr++;
 			}
 			offset = hashfile_total(f);
 		}
@@ -4587,9 +4590,10 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		fprintf_ln(stderr,
 			   _("Total %"PRIu32" (delta %"PRIu32"),"
 			     " reused %"PRIu32" (delta %"PRIu32"),"
-			     " pack-reused %"PRIu32),
+			     " pack-reused %"PRIu32" (from %"PRIuMAX")"),
 			   written, written_delta, reused, reused_delta,
-			   reuse_packfile_objects);
+			   reuse_packfile_objects,
+			   (uintmax_t)reuse_packfiles_used_nr);
 
 cleanup:
 	clear_packing_data(&to_pack);
-- 
2.43.0.102.ga31d690331.dirty


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH v2 17/26] git-compat-util.h: implement checked size_t to uint32_t conversion
  2023-12-14 22:23 ` [PATCH v2 00/26] " Taylor Blau
                     ` (15 preceding siblings ...)
  2023-12-14 22:24   ` [PATCH v2 16/26] pack-objects: include number of packs reused in output Taylor Blau
@ 2023-12-14 22:24   ` Taylor Blau
  2023-12-14 22:24   ` [PATCH v2 18/26] midx: implement `midx_preferred_pack()` Taylor Blau
                     ` (9 subsequent siblings)
  26 siblings, 0 replies; 107+ messages in thread
From: Taylor Blau @ 2023-12-14 22:24 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano

In a similar fashion as other checked cast functions in this header
(such as `cast_size_t_to_ulong()` and `cast_size_t_to_int()`), implement
a checked cast function for going from a size_t to a uint32_t value.

This function will be utilized in a future commit which needs to make
such a conversion.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 git-compat-util.h | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/git-compat-util.h b/git-compat-util.h
index 3e7a59b5ff..c3b6c2c226 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -1013,6 +1013,15 @@ static inline unsigned long cast_size_t_to_ulong(size_t a)
 	return (unsigned long)a;
 }
 
+static inline uint32_t cast_size_t_to_uint32_t(size_t a)
+{
+	if (a != (uint32_t)a)
+		die("object too large to read on this platform: %"
+		    PRIuMAX" is cut off to %u",
+		    (uintmax_t)a, (uint32_t)a);
+	return (uint32_t)a;
+}
+
 static inline int cast_size_t_to_int(size_t a)
 {
 	if (a > INT_MAX)
-- 
2.43.0.102.ga31d690331.dirty


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH v2 18/26] midx: implement `midx_preferred_pack()`
  2023-12-14 22:23 ` [PATCH v2 00/26] " Taylor Blau
                     ` (16 preceding siblings ...)
  2023-12-14 22:24   ` [PATCH v2 17/26] git-compat-util.h: implement checked size_t to uint32_t conversion Taylor Blau
@ 2023-12-14 22:24   ` Taylor Blau
  2023-12-14 22:24   ` [PATCH v2 19/26] pack-revindex: factor out `midx_key_to_pack_pos()` helper Taylor Blau
                     ` (8 subsequent siblings)
  26 siblings, 0 replies; 107+ messages in thread
From: Taylor Blau @ 2023-12-14 22:24 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano

When performing a binary search over the objects in a MIDX's bitmap
(i.e. in pseudo-pack order), the reader reconstructs the pseudo-pack
ordering using a combination of (a) the preferred pack, (b) the pack's
lexical position in the MIDX based on pack names, and (c) the object
offset within the pack.

In order to perform this binary search, the reader must know the
identity of the preferred pack. This could be stored in the MIDX, but
isn't for historical reasons, mostly because it can easily be inferred
at read-time by looking at the object in the first bit position and
finding out which pack it was selected from in the MIDX, like so:

    nth_midxed_pack_int_id(m, pack_pos_to_midx(m, 0));

In midx_to_pack_pos() which performs this binary search, we look up the
identity of the preferred pack before each search. This is relatively
quick, since it involves two table-driven lookups (one in the MIDX's
revindex for `pack_pos_to_midx()`, and another in the MIDX's object
table for `nth_midxed_pack_int_id()`).

But since the preferred pack does not change after the MIDX is written,
it is safe to cache this value on the MIDX itself.

Write a helper to do just that, and rewrite all of the existing
call-sites that care about the identity of the preferred pack in terms
of this new helper.

This will prepare us for a subsequent patch where we will need to binary
search through the MIDX's pseudo-pack order multiple times.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 midx.c                    | 20 ++++++++++++++++++++
 midx.h                    |  2 ++
 pack-bitmap.c             | 17 +++++++----------
 pack-bitmap.h             |  1 -
 pack-revindex.c           |  4 +++-
 t/helper/test-read-midx.c | 13 +++++--------
 6 files changed, 37 insertions(+), 20 deletions(-)

diff --git a/midx.c b/midx.c
index beaf0c0de4..85e1c2cd12 100644
--- a/midx.c
+++ b/midx.c
@@ -21,6 +21,7 @@
 #include "refs.h"
 #include "revision.h"
 #include "list-objects.h"
+#include "pack-revindex.h"
 
 #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
 #define MIDX_VERSION 1
@@ -177,6 +178,8 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
 
 	m->num_packs = get_be32(m->data + MIDX_BYTE_NUM_PACKS);
 
+	m->preferred_pack_idx = -1;
+
 	cf = init_chunkfile(NULL);
 
 	if (read_table_of_contents(cf, m->data, midx_size,
@@ -460,6 +463,23 @@ int midx_contains_pack(struct multi_pack_index *m, const char *idx_or_pack_name)
 	return midx_locate_pack(m, idx_or_pack_name, NULL);
 }
 
+int midx_preferred_pack(struct multi_pack_index *m, uint32_t *pack_int_id)
+{
+	if (m->preferred_pack_idx == -1) {
+		if (load_midx_revindex(m) < 0) {
+			m->preferred_pack_idx = -2;
+			return -1;
+		}
+
+		m->preferred_pack_idx =
+			nth_midxed_pack_int_id(m, pack_pos_to_midx(m, 0));
+	} else if (m->preferred_pack_idx == -2)
+		return -1; /* no revindex */
+
+	*pack_int_id = m->preferred_pack_idx;
+	return 0;
+}
+
 int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, int local)
 {
 	struct multi_pack_index *m;
diff --git a/midx.h b/midx.h
index 89c5aa637e..f87a8fff26 100644
--- a/midx.h
+++ b/midx.h
@@ -29,6 +29,7 @@ struct multi_pack_index {
 	unsigned char num_chunks;
 	uint32_t num_packs;
 	uint32_t num_objects;
+	int preferred_pack_idx;
 
 	int local;
 
@@ -74,6 +75,7 @@ int midx_contains_pack(struct multi_pack_index *m,
 		       const char *idx_or_pack_name);
 int midx_locate_pack(struct multi_pack_index *m, const char *idx_or_pack_name,
 		     uint32_t *pos);
+int midx_preferred_pack(struct multi_pack_index *m, uint32_t *pack_int_id);
 int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, int local);
 
 /*
diff --git a/pack-bitmap.c b/pack-bitmap.c
index 4d5a484678..1682f99596 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -338,7 +338,7 @@ static int open_midx_bitmap_1(struct bitmap_index *bitmap_git,
 	struct stat st;
 	char *bitmap_name = midx_bitmap_filename(midx);
 	int fd = git_open(bitmap_name);
-	uint32_t i;
+	uint32_t i, preferred_pack;
 	struct packed_git *preferred;
 
 	if (fd < 0) {
@@ -393,7 +393,12 @@ static int open_midx_bitmap_1(struct bitmap_index *bitmap_git,
 		}
 	}
 
-	preferred = bitmap_git->midx->packs[midx_preferred_pack(bitmap_git)];
+	if (midx_preferred_pack(bitmap_git->midx, &preferred_pack) < 0) {
+		warning(_("could not determine MIDX preferred pack"));
+		goto cleanup;
+	}
+
+	preferred = bitmap_git->midx->packs[preferred_pack];
 	if (!is_pack_valid(preferred)) {
 		warning(_("preferred pack (%s) is invalid"),
 			preferred->pack_name);
@@ -1926,14 +1931,6 @@ static int try_partial_reuse(struct bitmapped_pack *pack,
 	return 0;
 }
 
-uint32_t midx_preferred_pack(struct bitmap_index *bitmap_git)
-{
-	struct multi_pack_index *m = bitmap_git->midx;
-	if (!m)
-		BUG("midx_preferred_pack: requires non-empty MIDX");
-	return nth_midxed_pack_int_id(m, pack_pos_to_midx(bitmap_git->midx, 0));
-}
-
 static void reuse_partial_packfile_from_bitmap_1(struct bitmap_index *bitmap_git,
 						 struct bitmapped_pack *pack,
 						 struct bitmap *reuse)
diff --git a/pack-bitmap.h b/pack-bitmap.h
index 7a12a2ce81..179b343912 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -77,7 +77,6 @@ int test_bitmap_hashes(struct repository *r);
 
 struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
 					 int filter_provided_objects);
-uint32_t midx_preferred_pack(struct bitmap_index *bitmap_git);
 void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 					struct bitmapped_pack **packs_out,
 					size_t *packs_nr_out,
diff --git a/pack-revindex.c b/pack-revindex.c
index acf1dd9786..7dc6c776d5 100644
--- a/pack-revindex.c
+++ b/pack-revindex.c
@@ -542,7 +542,9 @@ int midx_to_pack_pos(struct multi_pack_index *m, uint32_t at, uint32_t *pos)
 	 * implicitly is preferred (and includes all its objects, since ties are
 	 * broken first by pack identifier).
 	 */
-	key.preferred_pack = nth_midxed_pack_int_id(m, pack_pos_to_midx(m, 0));
+	if (midx_preferred_pack(key.midx, &key.preferred_pack) < 0)
+		return error(_("could not determine preferred pack"));
+
 
 	found = bsearch(&key, m->revindex_data, m->num_objects,
 			sizeof(*m->revindex_data), midx_pack_order_cmp);
diff --git a/t/helper/test-read-midx.c b/t/helper/test-read-midx.c
index e48557aba1..4acae41bb9 100644
--- a/t/helper/test-read-midx.c
+++ b/t/helper/test-read-midx.c
@@ -6,6 +6,7 @@
 #include "pack-bitmap.h"
 #include "packfile.h"
 #include "setup.h"
+#include "gettext.h"
 
 static int read_midx_file(const char *object_dir, int show_objects)
 {
@@ -79,7 +80,7 @@ static int read_midx_checksum(const char *object_dir)
 static int read_midx_preferred_pack(const char *object_dir)
 {
 	struct multi_pack_index *midx = NULL;
-	struct bitmap_index *bitmap = NULL;
+	uint32_t preferred_pack;
 
 	setup_git_directory();
 
@@ -87,16 +88,12 @@ static int read_midx_preferred_pack(const char *object_dir)
 	if (!midx)
 		return 1;
 
-	bitmap = prepare_bitmap_git(the_repository);
-	if (!bitmap)
-		return 1;
-	if (!bitmap_is_midx(bitmap)) {
-		free_bitmap_index(bitmap);
+	if (midx_preferred_pack(midx, &preferred_pack) < 0) {
+		warning(_("could not determine MIDX preferred pack"));
 		return 1;
 	}
 
-	printf("%s\n", midx->pack_names[midx_preferred_pack(bitmap)]);
-	free_bitmap_index(bitmap);
+	printf("%s\n", midx->pack_names[preferred_pack]);
 	return 0;
 }
 
-- 
2.43.0.102.ga31d690331.dirty


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH v2 19/26] pack-revindex: factor out `midx_key_to_pack_pos()` helper
  2023-12-14 22:23 ` [PATCH v2 00/26] " Taylor Blau
                     ` (17 preceding siblings ...)
  2023-12-14 22:24   ` [PATCH v2 18/26] midx: implement `midx_preferred_pack()` Taylor Blau
@ 2023-12-14 22:24   ` Taylor Blau
  2023-12-14 22:24   ` [PATCH v2 20/26] pack-revindex: implement `midx_pair_to_pack_pos()` Taylor Blau
                     ` (7 subsequent siblings)
  26 siblings, 0 replies; 107+ messages in thread
From: Taylor Blau @ 2023-12-14 22:24 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano

The `midx_to_pack_pos()` function implements a binary search over
objects in the MIDX between lexical and pseudo-pack order. It does this
by taking in an index into the lexical order (i.e. the same argument
you'd use for `nth_midxed_object_id()` and similar) and spits out a
position in the pseudo-pack order.

This works for all callers, since they currently all are translating
from lexical order to pseudo-pack order. But future callers may want to
translate a known (offset, pack_id) tuple into an index into the
psuedo-pack order, without knowing where that (offset, pack_id) tuple
appears in lexical order.

Prepare for implementing a function that translates between a (offset,
pack_id) tuple into an index into the psuedo-pack order by extracting a
helper function which does just that, and then reimplementing
midx_to_pack_pos() in terms of it.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-revindex.c | 39 ++++++++++++++++++++++++---------------
 1 file changed, 24 insertions(+), 15 deletions(-)

diff --git a/pack-revindex.c b/pack-revindex.c
index 7dc6c776d5..baa4657ed3 100644
--- a/pack-revindex.c
+++ b/pack-revindex.c
@@ -520,19 +520,12 @@ static int midx_pack_order_cmp(const void *va, const void *vb)
 	return 0;
 }
 
-int midx_to_pack_pos(struct multi_pack_index *m, uint32_t at, uint32_t *pos)
+static int midx_key_to_pack_pos(struct multi_pack_index *m,
+				struct midx_pack_key *key,
+				uint32_t *pos)
 {
-	struct midx_pack_key key;
 	uint32_t *found;
 
-	if (!m->revindex_data)
-		BUG("midx_to_pack_pos: reverse index not yet loaded");
-	if (m->num_objects <= at)
-		BUG("midx_to_pack_pos: out-of-bounds object at %"PRIu32, at);
-
-	key.pack = nth_midxed_pack_int_id(m, at);
-	key.offset = nth_midxed_offset(m, at);
-	key.midx = m;
 	/*
 	 * The preferred pack sorts first, so determine its identifier by
 	 * looking at the first object in pseudo-pack order.
@@ -542,16 +535,32 @@ int midx_to_pack_pos(struct multi_pack_index *m, uint32_t at, uint32_t *pos)
 	 * implicitly is preferred (and includes all its objects, since ties are
 	 * broken first by pack identifier).
 	 */
-	if (midx_preferred_pack(key.midx, &key.preferred_pack) < 0)
+	if (midx_preferred_pack(key->midx, &key->preferred_pack) < 0)
 		return error(_("could not determine preferred pack"));
 
-
-	found = bsearch(&key, m->revindex_data, m->num_objects,
-			sizeof(*m->revindex_data), midx_pack_order_cmp);
+	found = bsearch(key, m->revindex_data, m->num_objects,
+			sizeof(*m->revindex_data),
+			midx_pack_order_cmp);
 
 	if (!found)
-		return error("bad offset for revindex");
+		return -1;
 
 	*pos = found - m->revindex_data;
 	return 0;
 }
+
+int midx_to_pack_pos(struct multi_pack_index *m, uint32_t at, uint32_t *pos)
+{
+	struct midx_pack_key key;
+
+	if (!m->revindex_data)
+		BUG("midx_to_pack_pos: reverse index not yet loaded");
+	if (m->num_objects <= at)
+		BUG("midx_to_pack_pos: out-of-bounds object at %"PRIu32, at);
+
+	key.pack = nth_midxed_pack_int_id(m, at);
+	key.offset = nth_midxed_offset(m, at);
+	key.midx = m;
+
+	return midx_key_to_pack_pos(m, &key, pos);
+}
-- 
2.43.0.102.ga31d690331.dirty


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH v2 20/26] pack-revindex: implement `midx_pair_to_pack_pos()`
  2023-12-14 22:23 ` [PATCH v2 00/26] " Taylor Blau
                     ` (18 preceding siblings ...)
  2023-12-14 22:24   ` [PATCH v2 19/26] pack-revindex: factor out `midx_key_to_pack_pos()` helper Taylor Blau
@ 2023-12-14 22:24   ` Taylor Blau
  2023-12-14 22:24   ` [PATCH v2 21/26] pack-bitmap: prepare to mark objects from multiple packs for reuse Taylor Blau
                     ` (6 subsequent siblings)
  26 siblings, 0 replies; 107+ messages in thread
From: Taylor Blau @ 2023-12-14 22:24 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano

Now that we have extracted the `midx_key_to_pack_pos()` function, we can
implement the `midx_pair_to_pack_pos()` function which accepts (pack_id,
offset) tuples and returns an index into the psuedo-pack order.

This will be used in a following commit in order to figure out whether
or not the MIDX chose a given delta's base object from the same pack as
the delta resides in. It will do so by locating the base object's offset
in the pack, and then performing a binary search using the same pack ID
with the base object's offset.

If (and only if) it finds a match (at any position) we can guarantee
that the MIDX selected both halves of the delta/base pair from the same
pack.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-revindex.c | 11 +++++++++++
 pack-revindex.h |  3 +++
 2 files changed, 14 insertions(+)

diff --git a/pack-revindex.c b/pack-revindex.c
index baa4657ed3..a7624d8be8 100644
--- a/pack-revindex.c
+++ b/pack-revindex.c
@@ -564,3 +564,14 @@ int midx_to_pack_pos(struct multi_pack_index *m, uint32_t at, uint32_t *pos)
 
 	return midx_key_to_pack_pos(m, &key, pos);
 }
+
+int midx_pair_to_pack_pos(struct multi_pack_index *m, uint32_t pack_int_id,
+			  off_t ofs, uint32_t *pos)
+{
+	struct midx_pack_key key = {
+		.pack = pack_int_id,
+		.offset = ofs,
+		.midx = m,
+	};
+	return midx_key_to_pack_pos(m, &key, pos);
+}
diff --git a/pack-revindex.h b/pack-revindex.h
index 6dd47efea1..422c2487ae 100644
--- a/pack-revindex.h
+++ b/pack-revindex.h
@@ -142,4 +142,7 @@ uint32_t pack_pos_to_midx(struct multi_pack_index *m, uint32_t pos);
  */
 int midx_to_pack_pos(struct multi_pack_index *midx, uint32_t at, uint32_t *pos);
 
+int midx_pair_to_pack_pos(struct multi_pack_index *midx, uint32_t pack_id,
+			  off_t ofs, uint32_t *pos);
+
 #endif
-- 
2.43.0.102.ga31d690331.dirty


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH v2 21/26] pack-bitmap: prepare to mark objects from multiple packs for reuse
  2023-12-14 22:23 ` [PATCH v2 00/26] " Taylor Blau
                     ` (19 preceding siblings ...)
  2023-12-14 22:24   ` [PATCH v2 20/26] pack-revindex: implement `midx_pair_to_pack_pos()` Taylor Blau
@ 2023-12-14 22:24   ` Taylor Blau
  2023-12-14 22:24   ` [PATCH v2 22/26] pack-objects: add tracing for various packfile metrics Taylor Blau
                     ` (5 subsequent siblings)
  26 siblings, 0 replies; 107+ messages in thread
From: Taylor Blau @ 2023-12-14 22:24 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano

Now that the pack-objects code is equipped to handle reusing objects
from multiple packs, prepare the pack-bitmap code to mark objects from
multiple packs as reuse candidates.

In order to prepare the pack-bitmap code for this change, remove the
same set of assumptions we unwound in previous commits from the helper
function `reuse_partial_packfile_from_bitmap_1()`, in preparation for it
to be called in a loop over the set of bitmapped packs in a following
commit.

Most importantly, we can no longer assume that the bit position
corresponding to the first object in a given reuse pack candidate is at
the beginning of the bitmap itself.

For the single pack that this assumption is still true for (in MIDX
bitmaps, this is the preferred pack, in single-pack bitmaps it is the
pack the bitmap is tied to), we can still use our whole-words
optimization.

But for all subsequent packs, we can not make use of this optimization,
since it assumes that all delta bases are being sent from the same pack,
which would break if we are sending OFS_DELTAs down to the client. To
understand why, consider two packs, P1 and P2 where:

  - P1 has object A which is a delta on base B
  - P2 has its own copy of B, in addition to other objects

Suppose that the MIDX which covers P1 and P2 selected its copy of A from
P1, but selected its copy of B from P2. Since A is a delta of B, but the
base was selected from a different pack, sending the bytes corresponding
to A as an OFS_DELTA verbatim from P1 would be incorrect, since we don't
guarantee that B is in the same place relative to A in the generated
pack as in P1.

For now, we detect and reject these cross-pack deltas by searching for
the (pack_id, offset) pair for the delta's base object (using the same
pack_id as the pack containing the delta'd object) in the MIDX. If we
find a match, that means that the MIDX did indeed pick the base object
from the same pack, and we are OK to reuse the delta.

If we don't find a match, however, that means that the base object was
selected from a different pack in the MIDX, and we can let the slower
path handle re-delta'ing our candidate object.

In the future, there are a couple of other things we could do, namely:

  - Turn any cross-pack deltas (which are stored as OFS_DELTAs) into
    REF_DELTAs. We already do this today when reusing an OFS_DELTA
    without `--delta-base-offset` enabled, so it's not a huge stretch to
    do the same for cross-pack deltas even when `--delta-base-offset` is
    enabled.

    This would work, but would obviously result in larger-than-necessary
    packs, as we in theory *could* represent these cross-pack deltas by
    patching an existing OFS_DELTA. But it's not clear how much that
    would matter in practice. I suspect it would have a lot to do with
    how you pack your repository in the first place.

  - Finally, we could patch OFS_DELTAs across packs in a similar fashion
    as we do today for OFS_DELTAs within a single pack on either side of
    a gap. This would result in the smallest packs of the three options
    here, but implementing this would be more involved.

    At minimum, you'd have to keep the reusable chunks list for all
    reused packs, not just the one we're currently processing. And you'd
    have to ensure that any bases which are a part of cross-pack deltas
    appear before the delta. I think this is possible to do, but would
    require assembling the reusable chunks list potentially in a
    different order than they appear in the source packs.

For now, let's pursue the simplest approach and reject any cross-pack
deltas.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-bitmap.c | 172 +++++++++++++++++++++++++++++++-------------------
 1 file changed, 106 insertions(+), 66 deletions(-)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index 1682f99596..242a5908f7 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -1841,8 +1841,10 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
  * -1 means "stop trying further objects"; 0 means we may or may not have
  * reused, but you can keep feeding bits.
  */
-static int try_partial_reuse(struct bitmapped_pack *pack,
-			     size_t pos,
+static int try_partial_reuse(struct bitmap_index *bitmap_git,
+			     struct bitmapped_pack *pack,
+			     size_t bitmap_pos,
+			     uint32_t pack_pos,
 			     struct bitmap *reuse,
 			     struct pack_window **w_curs)
 {
@@ -1850,33 +1852,10 @@ static int try_partial_reuse(struct bitmapped_pack *pack,
 	enum object_type type;
 	unsigned long size;
 
-	/*
-	 * try_partial_reuse() is called either on (a) objects in the
-	 * bitmapped pack (in the case of a single-pack bitmap) or (b)
-	 * objects in the preferred pack of a multi-pack bitmap.
-	 * Importantly, the latter can pretend as if only a single pack
-	 * exists because:
-	 *
-	 *   - The first pack->num_objects bits of a MIDX bitmap are
-	 *     reserved for the preferred pack, and
-	 *
-	 *   - Ties due to duplicate objects are always resolved in
-	 *     favor of the preferred pack.
-	 *
-	 * Therefore we do not need to ever ask the MIDX for its copy of
-	 * an object by OID, since it will always select it from the
-	 * preferred pack. Likewise, the selected copy of the base
-	 * object for any deltas will reside in the same pack.
-	 *
-	 * This means that we can reuse pos when looking up the bit in
-	 * the reuse bitmap, too, since bits corresponding to the
-	 * preferred pack precede all bits from other packs.
-	 */
+	if (pack_pos >= pack->p->num_objects)
+		return -1; /* not actually in the pack */
 
-	if (pos >= pack->p->num_objects)
-		return -1; /* not actually in the pack or MIDX preferred pack */
-
-	offset = delta_obj_offset = pack_pos_to_offset(pack->p, pos);
+	offset = delta_obj_offset = pack_pos_to_offset(pack->p, pack_pos);
 	type = unpack_object_header(pack->p, w_curs, &offset, &size);
 	if (type < 0)
 		return -1; /* broken packfile, punt */
@@ -1884,6 +1863,7 @@ static int try_partial_reuse(struct bitmapped_pack *pack,
 	if (type == OBJ_REF_DELTA || type == OBJ_OFS_DELTA) {
 		off_t base_offset;
 		uint32_t base_pos;
+		uint32_t base_bitmap_pos;
 
 		/*
 		 * Find the position of the base object so we can look it up
@@ -1897,20 +1877,44 @@ static int try_partial_reuse(struct bitmapped_pack *pack,
 					     delta_obj_offset);
 		if (!base_offset)
 			return 0;
-		if (offset_to_pack_pos(pack->p, base_offset, &base_pos) < 0)
-			return 0;
 
-		/*
-		 * We assume delta dependencies always point backwards. This
-		 * lets us do a single pass, and is basically always true
-		 * due to the way OFS_DELTAs work. You would not typically
-		 * find REF_DELTA in a bitmapped pack, since we only bitmap
-		 * packs we write fresh, and OFS_DELTA is the default). But
-		 * let's double check to make sure the pack wasn't written with
-		 * odd parameters.
-		 */
-		if (base_pos >= pos)
-			return 0;
+		offset_to_pack_pos(pack->p, base_offset, &base_pos);
+
+		if (bitmap_is_midx(bitmap_git)) {
+			/*
+			 * Cross-pack deltas are rejected for now, but could
+			 * theoretically be supported in the future.
+			 *
+			 * We would need to ensure that we're sending both
+			 * halves of the delta/base pair, regardless of whether
+			 * or not the two cross a pack boundary. If they do,
+			 * then we must convert the delta to an REF_DELTA to
+			 * refer back to the base in the other pack.
+			 * */
+			if (midx_pair_to_pack_pos(bitmap_git->midx,
+						  pack->pack_int_id,
+						  base_offset,
+						  &base_bitmap_pos) < 0) {
+				return 0;
+			}
+		} else {
+			if (offset_to_pack_pos(pack->p, base_offset,
+					       &base_pos) < 0)
+				return 0;
+			/*
+			 * We assume delta dependencies always point backwards.
+			 * This lets us do a single pass, and is basically
+			 * always true due to the way OFS_DELTAs work. You would
+			 * not typically find REF_DELTA in a bitmapped pack,
+			 * since we only bitmap packs we write fresh, and
+			 * OFS_DELTA is the default). But let's double check to
+			 * make sure the pack wasn't written with odd
+			 * parameters.
+			 */
+			if (base_pos >= pack_pos)
+				return 0;
+			base_bitmap_pos = pack->bitmap_pos + base_pos;
+		}
 
 		/*
 		 * And finally, if we're not sending the base as part of our
@@ -1920,14 +1924,14 @@ static int try_partial_reuse(struct bitmapped_pack *pack,
 		 * to REF_DELTA on the fly. Better to just let the normal
 		 * object_entry code path handle it.
 		 */
-		if (!bitmap_get(reuse, pack->bitmap_pos + base_pos))
+		if (!bitmap_get(reuse, base_bitmap_pos))
 			return 0;
 	}
 
 	/*
 	 * If we got here, then the object is OK to reuse. Mark it.
 	 */
-	bitmap_set(reuse, pack->bitmap_pos + pos);
+	bitmap_set(reuse, bitmap_pos);
 	return 0;
 }
 
@@ -1937,36 +1941,72 @@ static void reuse_partial_packfile_from_bitmap_1(struct bitmap_index *bitmap_git
 {
 	struct bitmap *result = bitmap_git->result;
 	struct pack_window *w_curs = NULL;
-	size_t i = 0;
+	size_t pos = pack->bitmap_pos / BITS_IN_EWORD;
 
-	while (i < result->word_alloc && result->words[i] == (eword_t)~0)
-		i++;
+	if (!pack->bitmap_pos) {
+		/*
+		 * If we're processing the first (in the case of a MIDX, the
+		 * preferred pack) or the only (in the case of single-pack
+		 * bitmaps) pack, then we can reuse whole words at a time.
+		 *
+		 * This is because we know that any deltas in this range *must*
+		 * have their bases chosen from the same pack, since:
+		 *
+		 * - In the single pack case, there is no other pack to choose
+		 *   them from.
+		 *
+		 * - In the MIDX case, the first pack is the preferred pack, so
+		 *   all ties are broken in favor of that pack (i.e. the one
+		 *   we're currently processing). So any duplicate bases will be
+		 *   resolved in favor of the pack we're processing.
+		 */
+		while (pos < result->word_alloc &&
+		       pos < pack->bitmap_nr / BITS_IN_EWORD &&
+		       result->words[pos] == (eword_t)~0)
+			pos++;
+		memset(reuse->words, 0xFF, pos * sizeof(eword_t));
+	}
 
-	/*
-	 * Don't mark objects not in the packfile or preferred pack. This bitmap
-	 * marks objects eligible for reuse, but the pack-reuse code only
-	 * understands how to reuse a single pack. Since the preferred pack is
-	 * guaranteed to have all bases for its deltas (in a multi-pack bitmap),
-	 * we use it instead of another pack. In single-pack bitmaps, the choice
-	 * is made for us.
-	 */
-	if (i > pack->p->num_objects / BITS_IN_EWORD)
-		i = pack->p->num_objects / BITS_IN_EWORD;
-
-	memset(reuse->words, 0xFF, i * sizeof(eword_t));
-
-	for (; i < result->word_alloc; ++i) {
-		eword_t word = result->words[i];
-		size_t pos = (i * BITS_IN_EWORD);
+	for (; pos < result->word_alloc; pos++) {
+		eword_t word = result->words[pos];
 		size_t offset;
 
-		for (offset = 0; offset < BITS_IN_EWORD; ++offset) {
-			if ((word >> offset) == 0)
+		for (offset = 0; offset < BITS_IN_EWORD; offset++) {
+			size_t bit_pos;
+			uint32_t pack_pos;
+
+			if (word >> offset == 0)
 				break;
 
 			offset += ewah_bit_ctz64(word >> offset);
-			if (try_partial_reuse(pack, pos + offset,
-					      reuse, &w_curs) < 0) {
+
+			bit_pos = pos * BITS_IN_EWORD + offset;
+			if (bit_pos < pack->bitmap_pos)
+				continue;
+			if (bit_pos >= pack->bitmap_pos + pack->bitmap_nr)
+				goto done;
+
+			if (bitmap_is_midx(bitmap_git)) {
+				uint32_t midx_pos;
+				off_t ofs;
+
+				midx_pos = pack_pos_to_midx(bitmap_git->midx, bit_pos);
+				ofs = nth_midxed_offset(bitmap_git->midx, midx_pos);
+
+				if (offset_to_pack_pos(pack->p, ofs, &pack_pos) < 0)
+					BUG("could not find object in pack %s "
+					    "at offset %"PRIuMAX" in MIDX",
+					    pack_basename(pack->p), (uintmax_t)ofs);
+			} else {
+				pack_pos = cast_size_t_to_uint32_t(st_sub(bit_pos, pack->bitmap_pos));
+				if (pack_pos >= pack->p->num_objects)
+					BUG("advanced beyond the end of pack %s (%"PRIuMAX" > %"PRIu32")",
+					    pack_basename(pack->p), (uintmax_t)pack_pos,
+					    pack->p->num_objects);
+			}
+
+			if (try_partial_reuse(bitmap_git, pack, bit_pos,
+					      pack_pos, reuse, &w_curs) < 0) {
 				/*
 				 * try_partial_reuse indicated we couldn't reuse
 				 * any bits, so there is no point in trying more
-- 
2.43.0.102.ga31d690331.dirty


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH v2 22/26] pack-objects: add tracing for various packfile metrics
  2023-12-14 22:23 ` [PATCH v2 00/26] " Taylor Blau
                     ` (20 preceding siblings ...)
  2023-12-14 22:24   ` [PATCH v2 21/26] pack-bitmap: prepare to mark objects from multiple packs for reuse Taylor Blau
@ 2023-12-14 22:24   ` Taylor Blau
  2023-12-14 22:24   ` [PATCH v2 23/26] t/test-lib-functions.sh: implement `test_trace2_data` helper Taylor Blau
                     ` (4 subsequent siblings)
  26 siblings, 0 replies; 107+ messages in thread
From: Taylor Blau @ 2023-12-14 22:24 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano

As part of the multi-pack reuse effort, we will want to add some tests
that assert that we reused a certain number of objects from a certain
number of packs.

We could do this by grepping through the stderr output of
`pack-objects`, but doing so would be brittle in case the output format
changed.

Instead, let's use the trace2 mechanism to log various pieces of
information about the generated packfile, which we can then use to
compare against desired values.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 7eb035eb7d..7aae9f104b 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -4595,6 +4595,13 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 			   reuse_packfile_objects,
 			   (uintmax_t)reuse_packfiles_used_nr);
 
+	trace2_data_intmax("pack-objects", the_repository, "written", written);
+	trace2_data_intmax("pack-objects", the_repository, "written/delta", written_delta);
+	trace2_data_intmax("pack-objects", the_repository, "reused", reused);
+	trace2_data_intmax("pack-objects", the_repository, "reused/delta", reused_delta);
+	trace2_data_intmax("pack-objects", the_repository, "pack-reused", reuse_packfile_objects);
+	trace2_data_intmax("pack-objects", the_repository, "packs-reused", reuse_packfiles_used_nr);
+
 cleanup:
 	clear_packing_data(&to_pack);
 	list_objects_filter_release(&filter_options);
-- 
2.43.0.102.ga31d690331.dirty


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH v2 23/26] t/test-lib-functions.sh: implement `test_trace2_data` helper
  2023-12-14 22:23 ` [PATCH v2 00/26] " Taylor Blau
                     ` (21 preceding siblings ...)
  2023-12-14 22:24   ` [PATCH v2 22/26] pack-objects: add tracing for various packfile metrics Taylor Blau
@ 2023-12-14 22:24   ` Taylor Blau
  2023-12-14 22:24   ` [PATCH v2 24/26] pack-objects: allow setting `pack.allowPackReuse` to "single" Taylor Blau
                     ` (3 subsequent siblings)
  26 siblings, 0 replies; 107+ messages in thread
From: Taylor Blau @ 2023-12-14 22:24 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano

Introduce a helper function which looks for a specific (category, key,
value) tuple in the output of a trace2 event stream.

We will use this function in a future patch to ensure that the expected
number of objects are reused from an expected number of packs.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/test-lib-functions.sh | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh
index 9c3cf12b26..93fe819b0a 100644
--- a/t/test-lib-functions.sh
+++ b/t/test-lib-functions.sh
@@ -1874,6 +1874,20 @@ test_region () {
 	return 0
 }
 
+# Check that the given data fragment was included as part of the
+# trace2-format trace on stdin.
+#
+#	test_trace2_data <category> <key> <value>
+#
+# For example, to look for trace2_data_intmax("pack-objects", repo,
+# "reused", N) in an invocation of "git pack-objects", run:
+#
+#	GIT_TRACE2_EVENT="$(pwd)/trace.txt" git pack-objects ... &&
+#	test_trace2_data pack-objects reused N <trace2.txt
+test_trace2_data () {
+	grep -e '"category":"'"$1"'","key":"'"$2"'","value":"'"$3"'"'
+}
+
 # Given a GIT_TRACE2_EVENT log over stdin, writes to stdout a list of URLs
 # sent to git-remote-https child processes.
 test_remote_https_urls() {
-- 
2.43.0.102.ga31d690331.dirty


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH v2 24/26] pack-objects: allow setting `pack.allowPackReuse` to "single"
  2023-12-14 22:23 ` [PATCH v2 00/26] " Taylor Blau
                     ` (22 preceding siblings ...)
  2023-12-14 22:24   ` [PATCH v2 23/26] t/test-lib-functions.sh: implement `test_trace2_data` helper Taylor Blau
@ 2023-12-14 22:24   ` Taylor Blau
  2023-12-14 22:24   ` [PATCH v2 25/26] pack-bitmap: enable reuse from all bitmapped packs Taylor Blau
                     ` (2 subsequent siblings)
  26 siblings, 0 replies; 107+ messages in thread
From: Taylor Blau @ 2023-12-14 22:24 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano

In e704fc7978 (pack-objects: introduce pack.allowPackReuse, 2019-12-18),
the `pack.allowPackReuse` configuration option was introduced, allowing
users to disable the pack reuse mechanism.

To prepare for debugging multi-pack reuse, allow setting configuration
to "single" in addition to the usual bool-or-int values.

"single" implies the same behavior as "true", "1", "yes", and so on. But
it will complement a new "multi" value (to be introduced in a future
commit). When set to "single", we will only perform pack reuse on a
single pack, regardless of whether or not there are multiple MIDX'd
packs.

This requires no code changes (yet), since we only support single pack
reuse.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/config/pack.txt |  2 +-
 builtin/pack-objects.c        | 19 ++++++++++++++++---
 2 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/Documentation/config/pack.txt b/Documentation/config/pack.txt
index f50df9dbce..fe100d0fb7 100644
--- a/Documentation/config/pack.txt
+++ b/Documentation/config/pack.txt
@@ -28,7 +28,7 @@ all existing objects. You can force recompression by passing the -F option
 to linkgit:git-repack[1].
 
 pack.allowPackReuse::
-	When true, and when reachability bitmaps are enabled,
+	When true or "single", and when reachability bitmaps are enabled,
 	pack-objects will try to send parts of the bitmapped packfile
 	verbatim. This can reduce memory and CPU usage to serve fetches,
 	but might result in sending a slightly larger pack. Defaults to
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 7aae9f104b..684698f679 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -229,7 +229,10 @@ static struct bitmap *reuse_packfile_bitmap;
 
 static int use_bitmap_index_default = 1;
 static int use_bitmap_index = -1;
-static int allow_pack_reuse = 1;
+static enum {
+	NO_PACK_REUSE = 0,
+	SINGLE_PACK_REUSE,
+} allow_pack_reuse = SINGLE_PACK_REUSE;
 static enum {
 	WRITE_BITMAP_FALSE = 0,
 	WRITE_BITMAP_QUIET,
@@ -3244,7 +3247,17 @@ static int git_pack_config(const char *k, const char *v,
 		return 0;
 	}
 	if (!strcmp(k, "pack.allowpackreuse")) {
-		allow_pack_reuse = git_config_bool(k, v);
+		int res = git_parse_maybe_bool_text(v);
+		if (res < 0) {
+			if (!strcasecmp(v, "single"))
+				allow_pack_reuse = SINGLE_PACK_REUSE;
+			else
+				die(_("invalid pack.allowPackReuse value: '%s'"), v);
+		} else if (res) {
+			allow_pack_reuse = SINGLE_PACK_REUSE;
+		} else {
+			allow_pack_reuse = NO_PACK_REUSE;
+		}
 		return 0;
 	}
 	if (!strcmp(k, "pack.threads")) {
@@ -3999,7 +4012,7 @@ static void loosen_unused_packed_objects(void)
  */
 static int pack_options_allow_reuse(void)
 {
-	return allow_pack_reuse &&
+	return allow_pack_reuse != NO_PACK_REUSE &&
 	       pack_to_stdout &&
 	       !ignore_packed_keep_on_disk &&
 	       !ignore_packed_keep_in_core &&
-- 
2.43.0.102.ga31d690331.dirty


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH v2 25/26] pack-bitmap: enable reuse from all bitmapped packs
  2023-12-14 22:23 ` [PATCH v2 00/26] " Taylor Blau
                     ` (23 preceding siblings ...)
  2023-12-14 22:24   ` [PATCH v2 24/26] pack-objects: allow setting `pack.allowPackReuse` to "single" Taylor Blau
@ 2023-12-14 22:24   ` Taylor Blau
  2023-12-14 22:24   ` [PATCH v2 26/26] t/perf: add performance tests for multi-pack reuse Taylor Blau
  2023-12-15  0:06   ` [PATCH v2 00/26] pack-objects: multi-pack verbatim reuse Junio C Hamano
  26 siblings, 0 replies; 107+ messages in thread
From: Taylor Blau @ 2023-12-14 22:24 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano

Now that both the pack-bitmap and pack-objects code are prepared to
handle marking and using objects from multiple bitmapped packs for
verbatim reuse, allow marking objects from all bitmapped packs as
eligible for reuse.

Within the `reuse_partial_packfile_from_bitmap()` function, we no longer
only mark the pack whose first object is at bit position zero for reuse,
and instead mark any pack contained in the MIDX as a reuse candidate.

Provide a handful of test cases in a new script (t5332) exercising
interesting behavior for multi-pack reuse to ensure that we performed
all of the previous steps correctly.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/config/pack.txt |  16 ++-
 builtin/pack-objects.c        |   6 +-
 pack-bitmap.c                 |  34 ++++--
 pack-bitmap.h                 |   3 +-
 t/t5332-multi-pack-reuse.sh   | 203 ++++++++++++++++++++++++++++++++++
 5 files changed, 245 insertions(+), 17 deletions(-)
 create mode 100755 t/t5332-multi-pack-reuse.sh

diff --git a/Documentation/config/pack.txt b/Documentation/config/pack.txt
index fe100d0fb7..9c630863e6 100644
--- a/Documentation/config/pack.txt
+++ b/Documentation/config/pack.txt
@@ -28,11 +28,17 @@ all existing objects. You can force recompression by passing the -F option
 to linkgit:git-repack[1].
 
 pack.allowPackReuse::
-	When true or "single", and when reachability bitmaps are enabled,
-	pack-objects will try to send parts of the bitmapped packfile
-	verbatim. This can reduce memory and CPU usage to serve fetches,
-	but might result in sending a slightly larger pack. Defaults to
-	true.
+	When true or "single", and when reachability bitmaps are
+	enabled, pack-objects will try to send parts of the bitmapped
+	packfile verbatim. When "multi", and when a multi-pack
+	reachability bitmap is available, pack-objects will try to send
+	parts of all packs in the MIDX.
++
+	If only a single pack bitmap is available, and
+	`pack.allowPackReuse` is set to "multi", reuse parts of just the
+	bitmapped packfile. This can reduce memory and CPU usage to
+	serve fetches, but might result in sending a slightly larger
+	pack. Defaults to true.
 
 pack.island::
 	An extended regular expression configuring a set of delta
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 684698f679..5d3c42035b 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -232,6 +232,7 @@ static int use_bitmap_index = -1;
 static enum {
 	NO_PACK_REUSE = 0,
 	SINGLE_PACK_REUSE,
+	MULTI_PACK_REUSE,
 } allow_pack_reuse = SINGLE_PACK_REUSE;
 static enum {
 	WRITE_BITMAP_FALSE = 0,
@@ -3251,6 +3252,8 @@ static int git_pack_config(const char *k, const char *v,
 		if (res < 0) {
 			if (!strcasecmp(v, "single"))
 				allow_pack_reuse = SINGLE_PACK_REUSE;
+			else if (!strcasecmp(v, "multi"))
+				allow_pack_reuse = MULTI_PACK_REUSE;
 			else
 				die(_("invalid pack.allowPackReuse value: '%s'"), v);
 		} else if (res) {
@@ -4029,7 +4032,8 @@ static int get_object_list_from_bitmap(struct rev_info *revs)
 		reuse_partial_packfile_from_bitmap(bitmap_git,
 						   &reuse_packfiles,
 						   &reuse_packfiles_nr,
-						   &reuse_packfile_bitmap);
+						   &reuse_packfile_bitmap,
+						   allow_pack_reuse == MULTI_PACK_REUSE);
 
 	if (reuse_packfiles) {
 		reuse_packfile_objects = bitmap_popcount(reuse_packfile_bitmap);
diff --git a/pack-bitmap.c b/pack-bitmap.c
index 242a5908f7..229a11fb00 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -2040,7 +2040,8 @@ static int bitmapped_pack_cmp(const void *va, const void *vb)
 void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 					struct bitmapped_pack **packs_out,
 					size_t *packs_nr_out,
-					struct bitmap **reuse_out)
+					struct bitmap **reuse_out,
+					int multi_pack_reuse)
 {
 	struct repository *r = the_repository;
 	struct bitmapped_pack *packs = NULL;
@@ -2064,15 +2065,30 @@ void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 				free(packs);
 				return;
 			}
+
 			if (!pack.bitmap_nr)
-				continue; /* no objects from this pack */
-			if (pack.bitmap_pos)
-				continue; /* not preferred pack */
+				continue;
+
+			if (!multi_pack_reuse && pack.bitmap_pos) {
+				/*
+				 * If we're only reusing a single pack, skip
+				 * over any packs which are not positioned at
+				 * the beginning of the MIDX bitmap.
+				 *
+				 * This is consistent with the existing
+				 * single-pack reuse behavior, which only reuses
+				 * parts of the MIDX's preferred pack.
+				 */
+				continue;
+			}
 
 			ALLOC_GROW(packs, packs_nr + 1, packs_alloc);
 			memcpy(&packs[packs_nr++], &pack, sizeof(pack));
 
 			objects_nr += pack.p->num_objects;
+
+			if (!multi_pack_reuse)
+				break;
 		}
 
 		QSORT(packs, packs_nr, bitmapped_pack_cmp);
@@ -2080,10 +2096,10 @@ void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 		ALLOC_GROW(packs, packs_nr + 1, packs_alloc);
 
 		packs[packs_nr].p = bitmap_git->pack;
-		packs[packs_nr].bitmap_pos = 0;
 		packs[packs_nr].bitmap_nr = bitmap_git->pack->num_objects;
+		packs[packs_nr].bitmap_pos = 0;
 
-		objects_nr = packs[packs_nr++].p->num_objects;
+		objects_nr = packs[packs_nr++].bitmap_nr;
 	}
 
 	word_alloc = objects_nr / BITS_IN_EWORD;
@@ -2091,10 +2107,8 @@ void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 		word_alloc++;
 	reuse = bitmap_word_alloc(word_alloc);
 
-	if (packs_nr != 1)
-		BUG("pack reuse not yet implemented for multiple packs");
-
-	reuse_partial_packfile_from_bitmap_1(bitmap_git, packs, reuse);
+	for (i = 0; i < packs_nr; i++)
+		reuse_partial_packfile_from_bitmap_1(bitmap_git, &packs[i], reuse);
 
 	if (bitmap_is_empty(reuse)) {
 		free(packs);
diff --git a/pack-bitmap.h b/pack-bitmap.h
index 179b343912..c7dea13217 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -80,7 +80,8 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
 void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 					struct bitmapped_pack **packs_out,
 					size_t *packs_nr_out,
-					struct bitmap **reuse_out);
+					struct bitmap **reuse_out,
+					int multi_pack_reuse);
 int rebuild_existing_bitmaps(struct bitmap_index *, struct packing_data *mapping,
 			     kh_oid_map_t *reused_bitmaps, int show_progress);
 void free_bitmap_index(struct bitmap_index *);
diff --git a/t/t5332-multi-pack-reuse.sh b/t/t5332-multi-pack-reuse.sh
new file mode 100755
index 0000000000..2ba788b042
--- /dev/null
+++ b/t/t5332-multi-pack-reuse.sh
@@ -0,0 +1,203 @@
+#!/bin/sh
+
+test_description='pack-objects multi-pack reuse'
+
+. ./test-lib.sh
+. "$TEST_DIRECTORY"/lib-bitmap.sh
+
+objdir=.git/objects
+packdir=$objdir/pack
+
+test_pack_reused () {
+	test_trace2_data pack-objects pack-reused "$1"
+}
+
+test_packs_reused () {
+	test_trace2_data pack-objects packs-reused "$1"
+}
+
+
+# pack_position <object> </path/to/pack.idx
+pack_position () {
+	git show-index >objects &&
+	grep "$1" objects | cut -d" " -f1
+}
+
+test_expect_success 'preferred pack is reused for single-pack reuse' '
+	test_config pack.allowPackReuse single &&
+
+	for i in A B
+	do
+		test_commit "$i" &&
+		git repack -d || return 1
+	done &&
+
+	git multi-pack-index write --bitmap &&
+
+	: >trace2.txt &&
+	GIT_TRACE2_EVENT="$PWD/trace2.txt" \
+		git pack-objects --stdout --revs --all >/dev/null &&
+
+	test_pack_reused 3 <trace2.txt &&
+	test_packs_reused 1 <trace2.txt
+'
+
+test_expect_success 'enable multi-pack reuse' '
+	git config pack.allowPackReuse multi
+'
+
+test_expect_success 'reuse all objects from subset of bitmapped packs' '
+	test_commit C &&
+	git repack -d &&
+
+	git multi-pack-index write --bitmap &&
+
+	cat >in <<-EOF &&
+	$(git rev-parse C)
+	^$(git rev-parse A)
+	EOF
+
+	: >trace2.txt &&
+	GIT_TRACE2_EVENT="$PWD/trace2.txt" \
+		git pack-objects --stdout --revs <in >/dev/null &&
+
+	test_pack_reused 6 <trace2.txt &&
+	test_packs_reused 2 <trace2.txt
+'
+
+test_expect_success 'reuse all objects from all packs' '
+	: >trace2.txt &&
+	GIT_TRACE2_EVENT="$PWD/trace2.txt" \
+		git pack-objects --stdout --revs --all >/dev/null &&
+
+	test_pack_reused 9 <trace2.txt &&
+	test_packs_reused 3 <trace2.txt
+'
+
+test_expect_success 'reuse objects from first pack with middle gap' '
+	for i in D E F
+	do
+		test_commit "$i" || return 1
+	done &&
+
+	# Set "pack.window" to zero to ensure that we do not create any
+	# deltas, which could alter the amount of pack reuse we perform
+	# (if, for e.g., we are not sending one or more bases).
+	D="$(git -c pack.window=0 pack-objects --all --unpacked $packdir/pack)" &&
+
+	d_pos="$(pack_position $(git rev-parse D) <$packdir/pack-$D.idx)" &&
+	e_pos="$(pack_position $(git rev-parse E) <$packdir/pack-$D.idx)" &&
+	f_pos="$(pack_position $(git rev-parse F) <$packdir/pack-$D.idx)" &&
+
+	# commits F, E, and D, should appear in that order at the
+	# beginning of the pack
+	test $f_pos -lt $e_pos &&
+	test $e_pos -lt $d_pos &&
+
+	# Ensure that the pack we are constructing sorts ahead of any
+	# other packs in lexical/bitmap order by choosing it as the
+	# preferred pack.
+	git multi-pack-index write --bitmap --preferred-pack="pack-$D.idx" &&
+
+	cat >in <<-EOF &&
+	$(git rev-parse E)
+	^$(git rev-parse D)
+	EOF
+
+	: >trace2.txt &&
+	GIT_TRACE2_EVENT="$PWD/trace2.txt" \
+		git pack-objects --stdout --delta-base-offset --revs <in >/dev/null &&
+
+	test_pack_reused 3 <trace2.txt &&
+	test_packs_reused 1 <trace2.txt
+'
+
+test_expect_success 'reuse objects from middle pack with middle gap' '
+	rm -fr $packdir/multi-pack-index* &&
+
+	# Ensure that the pack we are constructing sort into any
+	# position *but* the first one, by choosing a different pack as
+	# the preferred one.
+	git multi-pack-index write --bitmap --preferred-pack="pack-$A.idx" &&
+
+	cat >in <<-EOF &&
+	$(git rev-parse E)
+	^$(git rev-parse D)
+	EOF
+
+	: >trace2.txt &&
+	GIT_TRACE2_EVENT="$PWD/trace2.txt" \
+		git pack-objects --stdout --delta-base-offset --revs <in >/dev/null &&
+
+	test_pack_reused 3 <trace2.txt &&
+	test_packs_reused 1 <trace2.txt
+'
+
+test_expect_success 'omit delta with uninteresting base (same pack)' '
+	git repack -adk &&
+
+	test_seq 32 >f &&
+	git add f &&
+	test_tick &&
+	git commit -m "delta" &&
+	delta="$(git rev-parse HEAD)" &&
+
+	test_seq 64 >f &&
+	test_tick &&
+	git commit -a -m "base" &&
+	base="$(git rev-parse HEAD)" &&
+
+	test_commit other &&
+
+	git repack -d &&
+
+	have_delta "$(git rev-parse $delta:f)" "$(git rev-parse $base:f)" &&
+
+	git multi-pack-index write --bitmap &&
+
+	cat >in <<-EOF &&
+	$(git rev-parse other)
+	^$base
+	EOF
+
+	: >trace2.txt &&
+	GIT_TRACE2_EVENT="$PWD/trace2.txt" \
+		git pack-objects --stdout --delta-base-offset --revs <in >/dev/null &&
+
+	# We can only reuse the 3 objects corresponding to "other" from
+	# the latest pack.
+	#
+	# This is because even though we want "delta", we do not want
+	# "base", meaning that we have to inflate the delta/base-pair
+	# corresponding to the blob in commit "delta", which bypasses
+	# the pack-reuse mechanism.
+	#
+	# The remaining objects from the other pack are similarly not
+	# reused because their objects are on the uninteresting side of
+	# the query.
+	test_pack_reused 3 <trace2.txt &&
+	test_packs_reused 1 <trace2.txt
+'
+
+test_expect_success 'omit delta from uninteresting base (cross pack)' '
+	cat >in <<-EOF &&
+	$(git rev-parse $base)
+	^$(git rev-parse $delta)
+	EOF
+
+	P="$(git pack-objects --revs $packdir/pack <in)" &&
+
+	git multi-pack-index write --bitmap --preferred-pack="pack-$P.idx" &&
+
+	: >trace2.txt &&
+	GIT_TRACE2_EVENT="$PWD/trace2.txt" \
+		git pack-objects --stdout --delta-base-offset --all >/dev/null &&
+
+	packs_nr="$(find $packdir -type f -name "pack-*.pack" | wc -l)" &&
+	objects_nr="$(git rev-list --count --all --objects)" &&
+
+	test_pack_reused $(($objects_nr - 1)) <trace2.txt &&
+	test_packs_reused $packs_nr <trace2.txt
+'
+
+test_done
-- 
2.43.0.102.ga31d690331.dirty


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH v2 26/26] t/perf: add performance tests for multi-pack reuse
  2023-12-14 22:23 ` [PATCH v2 00/26] " Taylor Blau
                     ` (24 preceding siblings ...)
  2023-12-14 22:24   ` [PATCH v2 25/26] pack-bitmap: enable reuse from all bitmapped packs Taylor Blau
@ 2023-12-14 22:24   ` Taylor Blau
  2023-12-15  0:06   ` [PATCH v2 00/26] pack-objects: multi-pack verbatim reuse Junio C Hamano
  26 siblings, 0 replies; 107+ messages in thread
From: Taylor Blau @ 2023-12-14 22:24 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Patrick Steinhardt, Junio C Hamano

To ensure that we don't regress either the size or runtime performance
of multi-pack reuse, add a performance test to measure both of these.

The test partitions the objects in GIT_TEST_PERF_LARGE_REPO into 1, 10,
and 100 packs, and then tries to perform a "clone" at each stage with
both single- and multi-pack reuse enabled.

Note that the `repack_into_n_chunks()` function in this new test script
differs from the existing `repack_into_n()`. The former partitions the
repository into N equal-sized chunks, while the latter produces N packs
of five commits each (plus their objects), and then another pack with
the remainder.

On git.git, I can produce the following results on my machine:

    Test                                                            this tree
    --------------------------------------------------------------------------------
    5332.3: clone for 1-pack scenario (single-pack reuse)           1.57(2.99+0.15)
    5332.4: clone size for 1-pack scenario (single-pack reuse)               231.8M
    5332.5: clone for 1-pack scenario (multi-pack reuse)            1.79(2.96+0.21)
    5332.6: clone size for 1-pack scenario (multi-pack reuse)                231.7M
    5332.9: clone for 10-pack scenario (single-pack reuse)          3.89(16.75+0.35)
    5332.10: clone size for 10-pack scenario (single-pack reuse)             209.9M
    5332.11: clone for 10-pack scenario (multi-pack reuse)          1.56(2.99+0.17)
    5332.12: clone size for 10-pack scenario (multi-pack reuse)              224.4M
    5332.15: clone for 100-pack scenario (single-pack reuse)        8.24(54.31+0.59)
    5332.16: clone size for 100-pack scenario (single-pack reuse)            278.3M
    5332.17: clone for 100-pack scenario (multi-pack reuse)         2.13(2.44+0.33)
    5332.18: clone size for 100-pack scenario (multi-pack reuse)             357.9M

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/perf/p5332-multi-pack-reuse.sh | 81 ++++++++++++++++++++++++++++++++
 1 file changed, 81 insertions(+)
 create mode 100755 t/perf/p5332-multi-pack-reuse.sh

diff --git a/t/perf/p5332-multi-pack-reuse.sh b/t/perf/p5332-multi-pack-reuse.sh
new file mode 100755
index 0000000000..5c6c575d62
--- /dev/null
+++ b/t/perf/p5332-multi-pack-reuse.sh
@@ -0,0 +1,81 @@
+#!/bin/sh
+
+test_description='tests pack performance with multi-pack reuse'
+
+. ./perf-lib.sh
+. "${TEST_DIRECTORY}/perf/lib-pack.sh"
+
+packdir=.git/objects/pack
+
+test_perf_large_repo
+
+find_pack () {
+	for idx in $packdir/pack-*.idx
+	do
+		if git show-index <$idx | grep -q "$1"
+		then
+			basename $idx
+		fi || return 1
+	done
+}
+
+repack_into_n_chunks () {
+	git repack -adk &&
+
+	test "$1" -eq 1 && return ||
+
+	find $packdir -type f | sort >packs.before &&
+
+	# partition the repository into $1 chunks of consecutive commits, and
+	# then create $1 packs with the objects reachable from each chunk
+	# (excluding any objects reachable from the previous chunks)
+	sz="$(($(git rev-list --count --all) / $1))"
+	for rev in $(git rev-list --all | awk "NR % $sz == 0" | tac)
+	do
+		pack="$(echo "$rev" | git pack-objects --revs \
+			--honor-pack-keep --delta-base-offset $packdir/pack)" &&
+		touch $packdir/pack-$pack.keep || return 1
+	done
+
+	# grab any remaining objects not packed by the previous step(s)
+	git pack-objects --revs --all --honor-pack-keep --delta-base-offset \
+		$packdir/pack &&
+
+	find $packdir -type f | sort >packs.after &&
+
+	# and install the whole thing
+	for f in $(comm -12 packs.before packs.after)
+	do
+		rm -f "$f" || return 1
+	done
+	rm -fr $packdir/*.keep
+}
+
+for nr_packs in 1 10 100
+do
+	test_expect_success "create $nr_packs-pack scenario" '
+		repack_into_n_chunks $nr_packs
+	'
+
+	test_expect_success "setup bitmaps for $nr_packs-pack scenario" '
+		find $packdir -type f -name "*.idx" | sed -e "s/.*\/\(.*\)$/+\1/g" |
+		git multi-pack-index write --stdin-packs --bitmap \
+			--preferred-pack="$(find_pack $(git rev-parse HEAD))"
+	'
+
+	for reuse in single multi
+	do
+		test_perf "clone for $nr_packs-pack scenario ($reuse-pack reuse)" "
+			git for-each-ref --format='%(objectname)' refs/heads refs/tags >in &&
+			git -c pack.allowPackReuse=$reuse pack-objects \
+				--revs --delta-base-offset --use-bitmap-index \
+				--stdout <in >result
+		"
+
+		test_size "clone size for $nr_packs-pack scenario ($reuse-pack reuse)" '
+			wc -c <result
+		'
+	done
+done
+
+test_done
-- 
2.43.0.102.ga31d690331.dirty

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* Re: [PATCH v2 00/26] pack-objects: multi-pack verbatim reuse
  2023-12-14 22:23 ` [PATCH v2 00/26] " Taylor Blau
                     ` (25 preceding siblings ...)
  2023-12-14 22:24   ` [PATCH v2 26/26] t/perf: add performance tests for multi-pack reuse Taylor Blau
@ 2023-12-15  0:06   ` Junio C Hamano
  2023-12-15  0:38     ` Taylor Blau
  2023-12-15  0:40     ` Junio C Hamano
  26 siblings, 2 replies; 107+ messages in thread
From: Junio C Hamano @ 2023-12-15  0:06 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Patrick Steinhardt

I haven't looked into the details yet, but it seems that
t5309-pack-delta-cycles.sh fails under

    $ SANITIZE=leak GIT_TEST_PASSING_SANITIZE_LEAK=true make -j16 test


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH v2 00/26] pack-objects: multi-pack verbatim reuse
  2023-12-15  0:06   ` [PATCH v2 00/26] pack-objects: multi-pack verbatim reuse Junio C Hamano
@ 2023-12-15  0:38     ` Taylor Blau
  2023-12-15  0:40     ` Junio C Hamano
  1 sibling, 0 replies; 107+ messages in thread
From: Taylor Blau @ 2023-12-15  0:38 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Jeff King, Patrick Steinhardt

On Thu, Dec 14, 2023 at 04:06:49PM -0800, Junio C Hamano wrote:
> I haven't looked into the details yet, but it seems that
> t5309-pack-delta-cycles.sh fails under
>
>     $ SANITIZE=leak GIT_TEST_PASSING_SANITIZE_LEAK=true make -j16 test

Hrm. I tried to reproduce this, but I'm not seeing it. I have:

$ make SANITIZE=leak GIT_TEST_PASSING_SANITIZE_LEAK=true test
[ ... ]
All tests successful.
Files=1001, Tests=14558, 48 wallclock secs ( 8.34 usr  2.10 sys + 391.60 cusr 319.67 csys = 721.71 CPU)
Result: PASS

With this series applied on top of 1a87c842ec (Start the 2.44 cycle,
2023-12-09). The tree I get at the end is d148e16f5cfba405a9823cb68540a8c83004f98f.

Did we apply onto different bases?

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH v2 00/26] pack-objects: multi-pack verbatim reuse
  2023-12-15  0:06   ` [PATCH v2 00/26] pack-objects: multi-pack verbatim reuse Junio C Hamano
  2023-12-15  0:38     ` Taylor Blau
@ 2023-12-15  0:40     ` Junio C Hamano
  2023-12-15  1:25       ` Taylor Blau
  1 sibling, 1 reply; 107+ messages in thread
From: Junio C Hamano @ 2023-12-15  0:40 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Patrick Steinhardt

Junio C Hamano <gitster@pobox.com> writes:

> I haven't looked into the details yet, but it seems that
> t5309-pack-delta-cycles.sh fails under
>
>     $ SANITIZE=leak GIT_TEST_PASSING_SANITIZE_LEAK=true make -j16 test

Hmph, this seems to be elusive.  I tried it again and then it did
not fail this time.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH v2 00/26] pack-objects: multi-pack verbatim reuse
  2023-12-15  0:40     ` Junio C Hamano
@ 2023-12-15  1:25       ` Taylor Blau
  2023-12-21 10:51         ` Jeff King
  0 siblings, 1 reply; 107+ messages in thread
From: Taylor Blau @ 2023-12-15  1:25 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Jeff King, Patrick Steinhardt

On Thu, Dec 14, 2023 at 04:40:40PM -0800, Junio C Hamano wrote:
> Junio C Hamano <gitster@pobox.com> writes:
>
> > I haven't looked into the details yet, but it seems that
> > t5309-pack-delta-cycles.sh fails under
> >
> >     $ SANITIZE=leak GIT_TEST_PASSING_SANITIZE_LEAK=true make -j16 test
>
> Hmph, this seems to be elusive.  I tried it again and then it did
> not fail this time.

Indeed, but I was able to reproduce the failure both on my branch and on
'master' under --stress, yielding the following failure in t5309.6:

    + git index-pack --fix-thin --stdin
    fatal: REF_DELTA at offset 46 already resolved (duplicate base 01d7713666f4de822776c7622c10f1b07de280dc?)

    =================================================================
    ==3904583==ERROR: LeakSanitizer: detected memory leaks

    Direct leak of 32 byte(s) in 1 object(s) allocated from:
        #0 0x7fa790d01986 in __interceptor_realloc ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:98
        #1 0x7fa790add769 in __pthread_getattr_np nptl/pthread_getattr_np.c:180
        #2 0x7fa790d117c5 in __sanitizer::GetThreadStackTopAndBottom(bool, unsigned long*, unsigned long*) ../../../../src/libsanitizer/sanitizer_common/sanitizer_linux_libcdep.cpp:150
        #3 0x7fa790d11957 in __sanitizer::GetThreadStackAndTls(bool, unsigned long*, unsigned long*, unsigned long*, unsigned long*) ../../../../src/libsanitizer/sanitizer_common/sanitizer_linux_libcdep.cpp:598
        #4 0x7fa790d03fe8 in __lsan::ThreadStart(unsigned int, unsigned long long, __sanitizer::ThreadType) ../../../../src/libsanitizer/lsan/lsan_posix.cpp:51
        #5 0x7fa790d013fd in __lsan_thread_start_func ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:440
        #6 0x7fa790adc3eb in start_thread nptl/pthread_create.c:444
        #7 0x7fa790b5ca5b in clone3 ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

    SUMMARY: LeakSanitizer: 32 byte(s) leaked in 1 allocation(s).
    Aborted

The duplicate base thing is a red-herring, and is an expected result of
the test. But the leak is definitely not, and I'm not sure what's going
on here since the frames listed above are in the LSan runtime, not Git.

I'll try to dig into this a bit more, but I'm not quite sure what's
going on yet.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/24] pack-objects: multi-pack verbatim reuse
  2023-12-12  8:12 ` Jeff King
@ 2023-12-15 15:37   ` Taylor Blau
  2023-12-21 11:13     ` Jeff King
  0 siblings, 1 reply; 107+ messages in thread
From: Taylor Blau @ 2023-12-15 15:37 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Patrick Steinhardt, Junio C Hamano

On Tue, Dec 12, 2023 at 03:12:38AM -0500, Jeff King wrote:
> So my question is: how much of what you're seeing is from (1) and (2),
> and how much is from (3)? Because there are other ways to trigger (3),
> such as lowering the window size. For example, if you try your same
> packing example with --window=0, how do the CPU and output size compare
> to the results of your series? (I'd also check peak memory usage).

Interesting question! Here are some preliminary numbers on my machine
(which runs Debian unstable with a Intel Xenon W-2255 CPU @ 3.70GHz and
64GB of RAM).

I ran the following hyperfine command on my testing repository, which
has the Git repository broken up into ~75 packs or so:

    $ hyperfine -L v single,multi -L window 0,10 \
      --show-output \
      -n '{v}-pack reuse, pack.window={window}' \
      'git.compile \
        -c pack.allowPackReuse={v} \
        -c pack.window={window} \
        pack-objects --revs --stdout --use-bitmap-index --delta-base-offset <in 2>/dev/null | wc -c'

Which gave the following results for runtime:

    Benchmark 1: single-pack reuse, pack.window=0
    [...]
      Time (mean ± σ):      1.248 s ±  0.004 s    [User: 1.160 s, System: 0.188 s]
      Range (min … max):    1.244 s …  1.259 s    10 runs

    Benchmark 2: multi-pack reuse, pack.window=0
    [...]
      Time (mean ± σ):      1.075 s ±  0.005 s    [User: 0.990 s, System: 0.188 s]
      Range (min … max):    1.071 s …  1.088 s    10 runs

    Benchmark 3: single-pack reuse, pack.window=10
    [...]
      Time (mean ± σ):      6.281 s ±  0.024 s    [User: 43.727 s, System: 0.492 s]
      Range (min … max):    6.252 s …  6.326 s    10 runs

    Benchmark 4: multi-pack reuse, pack.window=10
    [...]
      Time (mean ± σ):      1.028 s ±  0.002 s    [User: 1.150 s, System: 0.184 s]
      Range (min … max):    1.026 s …  1.032 s    10 runs

Here are the average numbers for the resulting "clone" size in each of
the above configurations:

    Benchmark 1: single-pack reuse, pack.window=0
    264.443 MB
    Benchmark 2: multi-pack reuse, pack.window=0
    268.670 MB
    Benchmark 3: single-pack reuse, pack.window=10
    194.355 MB
    Benchmark 4: multi-pack reuse, pack.window=10
    266.473 MB

So it looks like setting pack.window=0 (with both single and multi-pack
reuse) yields a similarly sized pack output as multi-pack reuse with any
window setting.

Running the same benchmark as above again, but this time sending the
pack output to /dev/null and instead capturing the maximum RSS value
from `/usr/bin/time -v` gives us the following (averages, in MB):

    Benchmark 1: single-pack reuse, pack.window=0
    354.224 MB (max RSS)
    Benchmark 2: multi-pack reuse, pack.window=0
    315.730 MB (max RSS)
    Benchmark 3: single-pack reuse, pack.window=10
    470.651 MB (max RSS)
    Benchmark 4: multi-pack reuse, pack.window=10
    328.786 MB (max RSS)

So memory usage is similar between runs except for the single-pack reuse
case with a window size of 10.

It looks like the sweet spot is probably multi-pack reuse with a
small-ish window size, which achieves the best of both worlds (small
pack size, relative to other configurations that reuse large portions of
the pack, and low memory usage).

It's pretty close between multi-pack reuse with a window size of 0 and
a window size of 10. If you want to optimize for pack size, you could
trade a ~4% reduction in pack size for a ~1% increase in peak memory
usage.

Of course, YMMV depending on the repository, packing strategy, and
pack.window configuration (among others) while packing. But this should
give you a general idea of what to expect.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH v2 00/26] pack-objects: multi-pack verbatim reuse
  2023-12-15  1:25       ` Taylor Blau
@ 2023-12-21 10:51         ` Jeff King
  2024-01-04 22:24           ` Taylor Blau
  0 siblings, 1 reply; 107+ messages in thread
From: Jeff King @ 2023-12-21 10:51 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Junio C Hamano, git, Patrick Steinhardt

On Thu, Dec 14, 2023 at 08:25:35PM -0500, Taylor Blau wrote:

> On Thu, Dec 14, 2023 at 04:40:40PM -0800, Junio C Hamano wrote:
> > Junio C Hamano <gitster@pobox.com> writes:
> >
> > > I haven't looked into the details yet, but it seems that
> > > t5309-pack-delta-cycles.sh fails under
> > >
> > >     $ SANITIZE=leak GIT_TEST_PASSING_SANITIZE_LEAK=true make -j16 test
> >
> > Hmph, this seems to be elusive.  I tried it again and then it did
> > not fail this time.
> 
> Indeed, but I was able to reproduce the failure both on my branch and on
> 'master' under --stress, yielding the following failure in t5309.6:

OK, so it's nothing new, and we can ignore it for your series (I haven't
seen it in the wild yet, but it is something we may need to deal with in
the long run if it keeps popping up).

>     + git index-pack --fix-thin --stdin
>     fatal: REF_DELTA at offset 46 already resolved (duplicate base 01d7713666f4de822776c7622c10f1b07de280dc?)
> 
>     =================================================================
>     ==3904583==ERROR: LeakSanitizer: detected memory leaks
> 
>     Direct leak of 32 byte(s) in 1 object(s) allocated from:
>         #0 0x7fa790d01986 in __interceptor_realloc ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:98
>         #1 0x7fa790add769 in __pthread_getattr_np nptl/pthread_getattr_np.c:180
>         #2 0x7fa790d117c5 in __sanitizer::GetThreadStackTopAndBottom(bool, unsigned long*, unsigned long*) ../../../../src/libsanitizer/sanitizer_common/sanitizer_linux_libcdep.cpp:150
>         #3 0x7fa790d11957 in __sanitizer::GetThreadStackAndTls(bool, unsigned long*, unsigned long*, unsigned long*, unsigned long*) ../../../../src/libsanitizer/sanitizer_common/sanitizer_linux_libcdep.cpp:598
>         #4 0x7fa790d03fe8 in __lsan::ThreadStart(unsigned int, unsigned long long, __sanitizer::ThreadType) ../../../../src/libsanitizer/lsan/lsan_posix.cpp:51
>         #5 0x7fa790d013fd in __lsan_thread_start_func ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:440
>         #6 0x7fa790adc3eb in start_thread nptl/pthread_create.c:444
>         #7 0x7fa790b5ca5b in clone3 ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
> 
>     SUMMARY: LeakSanitizer: 32 byte(s) leaked in 1 allocation(s).
>     Aborted
> 
> The duplicate base thing is a red-herring, and is an expected result of
> the test. But the leak is definitely not, and I'm not sure what's going
> on here since the frames listed above are in the LSan runtime, not Git.

I suspect this is a race in LSan caused by a thread calling exit() while
other threads are spawning. Here's my theory.

When a thread is spawned, LSan needs to know where its stack is (so it
can look for points to reachable memory). It calls pthread_getattr_np(),
which gets an attributes object that must be cleaned up with
pthread_attr_destroy(). Presumably it does this shortly after. But
there's a race window where that attr object is allocated and we haven't
yet set up the new thread's info. If another thread calls exit() then,
LSan will run but its book-keeping will be in an inconsistent state.

One way to work around that would be to make the creation of the threads
atomic. That is, create each in a suspended state, and only let them run
once they are all created. There's no option in pthreads to do this, but
we can simulate it by having them block on a mutex before starting. And
indeed, we already have such a lock: the work_lock() that they all use
to get data to process.

After applying this patch:

diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index dda94a9f46..0e94819216 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -1257,13 +1257,15 @@ static void resolve_deltas(void)
 	base_cache_limit = delta_base_cache_limit * nr_threads;
 	if (nr_threads > 1 || getenv("GIT_FORCE_THREADS")) {
 		init_thread();
+		work_lock();
 		for (i = 0; i < nr_threads; i++) {
 			int ret = pthread_create(&thread_data[i].thread, NULL,
 						 threaded_second_pass, thread_data + i);
 			if (ret)
 				die(_("unable to create thread: %s"),
 				    strerror(ret));
 		}
+		work_unlock();
 		for (i = 0; i < nr_threads; i++)
 			pthread_join(thread_data[i].thread, NULL);
 		cleanup_thread();

I ran t5309 with "--stress --run=6" for about 5 minutes with no failures
(whereas without the patch, I usually see a failure in 10 seconds or
so).

So it's a pretty easy fix, though I don't love it in general. Every
place that spawns multiple threads that can die() would need the same
treatment. And this isn't a "real" leak in any reasonable sense; it only
happens because we're exiting the program directly, at which point all
of the memory is returned to the OS anyway. So I hate changing
production code to satisfy a leak-checking false positive.

OTOH, dealing with false positives is annoying for humans, and the
run-time cost should be negligible. We can work around this one, and
avoid making the same change in other spots unless somebody sees a racy
failure in practice.

-Peff

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/24] pack-objects: multi-pack verbatim reuse
  2023-12-15 15:37   ` Taylor Blau
@ 2023-12-21 11:13     ` Jeff King
  2024-01-04 22:22       ` Taylor Blau
  0 siblings, 1 reply; 107+ messages in thread
From: Jeff King @ 2023-12-21 11:13 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Patrick Steinhardt, Junio C Hamano

On Fri, Dec 15, 2023 at 10:37:57AM -0500, Taylor Blau wrote:

> On Tue, Dec 12, 2023 at 03:12:38AM -0500, Jeff King wrote:
> > So my question is: how much of what you're seeing is from (1) and (2),
> > and how much is from (3)? Because there are other ways to trigger (3),
> > such as lowering the window size. For example, if you try your same
> > packing example with --window=0, how do the CPU and output size compare
> > to the results of your series? (I'd also check peak memory usage).
> 
> Interesting question! Here are some preliminary numbers on my machine
> (which runs Debian unstable with a Intel Xenon W-2255 CPU @ 3.70GHz and
> 64GB of RAM).
> 
> I ran the following hyperfine command on my testing repository, which
> has the Git repository broken up into ~75 packs or so:

Thanks for running these tests. The results are similar to what
expected, which is: yes, most of your CPU savings are from skipping
deltas, but not all.

Here's what I see (which I think is mostly redundant with what you've
said, but I just want to lay out my line of thinking). I'll reorder your
quoted sections a bit as I go:

>     Benchmark 2: multi-pack reuse, pack.window=0
>     [...]
>       Time (mean ± σ):      1.075 s ±  0.005 s    [User: 0.990 s, System: 0.188 s]
>       Range (min … max):    1.071 s …  1.088 s    10 runs
>
>     Benchmark 4: multi-pack reuse, pack.window=10
>     [...]
>       Time (mean ± σ):      1.028 s ±  0.002 s    [User: 1.150 s, System: 0.184 s]
>       Range (min … max):    1.026 s …  1.032 s    10 runs

OK, so when we're doing more full ("multi") reuse, the pack window
doesn't make a big difference either way. You didn't show the stderr
from each, but presumably most of the objects are hitting the "reuse"
path, and only a few are deltas (and that is backed up by the fact that
doing deltas only gives us a slight improvement in the output size:

>     Benchmark 2: multi-pack reuse, pack.window=0
>     268.670 MB
>     Benchmark 4: multi-pack reuse, pack.window=10
>     266.473 MB

Comparing the runs with less reuse:

>     Benchmark 1: single-pack reuse, pack.window=0
>     [...]
>       Time (mean ± σ):      1.248 s ±  0.004 s    [User: 1.160 s, System: 0.188 s]
>       Range (min … max):    1.244 s …  1.259 s    10 runs
>
>     Benchmark 3: single-pack reuse, pack.window=10
>     [...]
>       Time (mean ± σ):      6.281 s ±  0.024 s    [User: 43.727 s, System: 0.492 s]
>       Range (min … max):    6.252 s …  6.326 s    10 runs

there obviously is a huge amount of time saved by not doing deltas, but
we pay for it with a much bigger pack:

>     Benchmark 1: single-pack reuse, pack.window=0
>     264.443 MB
>     Benchmark 3: single-pack reuse, pack.window=10
>     194.355 MB

But of course that "much bigger" pack is about the same size as the one
we get from doing multi-pack reuse. Which is not surprising, because
both are avoiding looking for new deltas (and the packs after the
preferred one probably have mediocre deltas).

So I do actually think that disabling pack.window gives you a
similar-ish tradeoff to expanding the pack-reuse code (~6s down to ~1s,
and a 36% embiggening of the resulting pack size).

Which implies that one option is to scrap your entire series and just
set pack.window. Basically comparing multi/10 (your patches) to single/0
(a hypothetical config option), which have similar run-times and pack
sizes.

But that's not quite the whole story. There is still a CPU improvement
in your series (1.2s vs 1.0s, a 20% speedup). And as I'd expect, a
memory improvement from avoiding the extra book-keeping (almost 10%):

>     Benchmark 1: single-pack reuse, pack.window=0
>     354.224 MB (max RSS)
>     Benchmark 4: multi-pack reuse, pack.window=10
>     328.786 MB (max RSS)

So while it's a lot less code to just set the window size, I do think
those improvements are worth it. And really, it's the same tradeoff we
make for the single-pack case (i.e., one could argue that we
could/should rip out the verbatim-reuse code entirely in favor of just
tweaking the window size).

> It's pretty close between multi-pack reuse with a window size of 0 and
> a window size of 10. If you want to optimize for pack size, you could
> trade a ~4% reduction in pack size for a ~1% increase in peak memory
> usage.

I think if you want to optimize for pack size, you should consider
repacking all-into-one to get better on-disk deltas. ;) I know that's
easier said than done when the I/O costs are significant. I do wonder if
storing thin packs on disk would let us more cheaply reach a state that
could serve optimal-ish packs without spending CPU computing bespoke
deltas for each client. But that's a much larger topic.

-Peff

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/24] pack-objects: multi-pack verbatim reuse
  2023-12-21 11:13     ` Jeff King
@ 2024-01-04 22:22       ` Taylor Blau
  0 siblings, 0 replies; 107+ messages in thread
From: Taylor Blau @ 2024-01-04 22:22 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Patrick Steinhardt, Junio C Hamano

On Thu, Dec 21, 2023 at 06:13:33AM -0500, Jeff King wrote:
> But that's not quite the whole story. There is still a CPU improvement
> in your series (1.2s vs 1.0s, a 20% speedup). And as I'd expect, a
> memory improvement from avoiding the extra book-keeping (almost 10%):
>
> >     Benchmark 1: single-pack reuse, pack.window=0
> >     354.224 MB (max RSS)
> >     Benchmark 4: multi-pack reuse, pack.window=10
> >     328.786 MB (max RSS)

I agree. And I expect that we'd see larger savings on larger, real-world
repositories (the numbers here are generated from a semi out-of-date
copy of git.git).

> So while it's a lot less code to just set the window size, I do think
> those improvements are worth it. And really, it's the same tradeoff we
> make for the single-pack case (i.e., one could argue that we
> could/should rip out the verbatim-reuse code entirely in favor of just
> tweaking the window size).

Definitely an interesting direction. One question that comes to mind is
whether or not we'd want to keep the "verbatim" reuse code around to
avoid adding objects to the packing list directly. That's a
non-negligible memory savings when generating large packs where we can
reuse large swaths of existing packs.

That seems like a topic for another day, though ;-). Not quite
left-over-bits, so maybe... #leftoverboulders?

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH v2 00/26] pack-objects: multi-pack verbatim reuse
  2023-12-21 10:51         ` Jeff King
@ 2024-01-04 22:24           ` Taylor Blau
  0 siblings, 0 replies; 107+ messages in thread
From: Taylor Blau @ 2024-01-04 22:24 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git, Patrick Steinhardt

On Thu, Dec 21, 2023 at 05:51:24AM -0500, Jeff King wrote:
> I suspect this is a race in LSan caused by a thread calling exit() while
> other threads are spawning. Here's my theory.
>
> When a thread is spawned, LSan needs to know where its stack is (so it
> can look for points to reachable memory). It calls pthread_getattr_np(),
> which gets an attributes object that must be cleaned up with
> pthread_attr_destroy(). Presumably it does this shortly after. But
> there's a race window where that attr object is allocated and we haven't
> yet set up the new thread's info. If another thread calls exit() then,
> LSan will run but its book-keeping will be in an inconsistent state.

Thanks for digging. I agree with your theory, and am annoyed with how
clever it is ;-).

> So it's a pretty easy fix, though I don't love it in general. Every
> place that spawns multiple threads that can die() would need the same
> treatment. And this isn't a "real" leak in any reasonable sense; it only
> happens because we're exiting the program directly, at which point all
> of the memory is returned to the OS anyway. So I hate changing
> production code to satisfy a leak-checking false positive.
>
> OTOH, dealing with false positives is annoying for humans, and the
> run-time cost should be negligible. We can work around this one, and
> avoid making the same change in other spots unless somebody sees a racy
> failure in practice.

Yeah... I share your thoughts here as well. It's kind of gross that we
have to touch production code purely to appease the leak checker, but I
think that the trade-off is worth it since:

  - the false positives are annoying to diagnose (as you said, and as
    evidenced by the time that you, Junio, and myself have sunk into
    discussing this ;-)).

  - the run-time cost is negligible.

So I think that this is a good change to make, and I'm happy to see it
go through. I don't think we should necessarily try too hard to find all
spots that might benefit from a similar change, and instead just apply
the same treatment if/when we notice false positives in CI.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 107+ messages in thread

end of thread, other threads:[~2024-01-04 22:24 UTC | newest]

Thread overview: 107+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-11-28 19:07 [PATCH 00/24] pack-objects: multi-pack verbatim reuse Taylor Blau
2023-11-28 19:07 ` [PATCH 01/24] pack-objects: free packing_data in more places Taylor Blau
2023-11-30 10:18   ` Patrick Steinhardt
2023-11-30 19:08     ` Taylor Blau
2023-11-28 19:07 ` [PATCH 02/24] pack-bitmap-write: deep-clear the `bb_commit` slab Taylor Blau
2023-11-30 10:18   ` Patrick Steinhardt
2023-11-30 19:11     ` Taylor Blau
2023-12-12  7:04   ` Jeff King
2023-12-12 16:48     ` Taylor Blau
2023-11-28 19:08 ` [PATCH 03/24] pack-bitmap: plug leak in find_objects() Taylor Blau
2023-12-12  7:04   ` Jeff King
2023-11-28 19:08 ` [PATCH 04/24] midx: factor out `fill_pack_info()` Taylor Blau
2023-11-30 10:18   ` Patrick Steinhardt
2023-11-30 19:19     ` Taylor Blau
2023-11-28 19:08 ` [PATCH 05/24] midx: implement `DISP` chunk Taylor Blau
2023-11-30 10:18   ` Patrick Steinhardt
2023-11-30 19:27     ` Taylor Blau
2023-12-03 13:15   ` Junio C Hamano
2023-12-05 19:26     ` Taylor Blau
2023-12-09  1:40       ` Junio C Hamano
2023-12-09  2:30         ` Taylor Blau
2023-12-12  8:03           ` Jeff King
2023-12-13 18:28             ` Taylor Blau
2023-12-13 19:20               ` Junio C Hamano
2023-11-28 19:08 ` [PATCH 06/24] midx: implement `midx_locate_pack()` Taylor Blau
2023-11-28 19:08 ` [PATCH 07/24] midx: implement `--retain-disjoint` mode Taylor Blau
2023-11-30 10:18   ` Patrick Steinhardt
2023-11-30 19:29     ` Taylor Blau
2023-12-01  8:02       ` Patrick Steinhardt
2023-11-28 19:08 ` [PATCH 08/24] pack-objects: implement `--ignore-disjoint` mode Taylor Blau
2023-11-30 10:18   ` Patrick Steinhardt
2023-11-30 19:32     ` Taylor Blau
2023-12-01  8:17       ` Patrick Steinhardt
2023-12-01 19:58         ` Taylor Blau
2023-11-28 19:08 ` [PATCH 09/24] repack: implement `--extend-disjoint` mode Taylor Blau
2023-12-07 13:13   ` Patrick Steinhardt
2023-12-07 20:28     ` Taylor Blau
2023-12-08  8:19       ` Patrick Steinhardt
2023-12-08 22:48         ` Taylor Blau
2023-12-11  8:18           ` Patrick Steinhardt
2023-12-11 19:59             ` Taylor Blau
2023-11-28 19:08 ` [PATCH 10/24] pack-bitmap: pass `bitmapped_pack` struct to pack-reuse functions Taylor Blau
2023-12-07 13:13   ` Patrick Steinhardt
2023-12-07 20:34     ` Taylor Blau
2023-12-08  8:19       ` Patrick Steinhardt
2023-11-28 19:08 ` [PATCH 11/24] pack-bitmap: simplify `reuse_partial_packfile_from_bitmap()` signature Taylor Blau
2023-12-07 13:13   ` Patrick Steinhardt
2023-12-07 14:36     ` Taylor Blau
2023-11-28 19:08 ` [PATCH 12/24] pack-bitmap: return multiple packs via `reuse_partial_packfile_from_bitmap()` Taylor Blau
2023-11-28 19:08 ` [PATCH 13/24] pack-objects: parameterize pack-reuse routines over a single pack Taylor Blau
2023-11-28 19:08 ` [PATCH 14/24] pack-objects: keep track of `pack_start` for each reuse pack Taylor Blau
2023-12-07 13:13   ` Patrick Steinhardt
2023-12-07 20:43     ` Taylor Blau
2023-11-28 19:08 ` [PATCH 15/24] pack-objects: pass `bitmapped_pack`'s to pack-reuse functions Taylor Blau
2023-11-28 19:08 ` [PATCH 16/24] pack-objects: prepare `write_reused_pack()` for multi-pack reuse Taylor Blau
2023-12-07 13:13   ` Patrick Steinhardt
2023-12-07 20:47     ` Taylor Blau
2023-11-28 19:08 ` [PATCH 17/24] pack-objects: prepare `write_reused_pack_verbatim()` " Taylor Blau
2023-11-28 19:08 ` [PATCH 18/24] pack-objects: include number of packs reused in output Taylor Blau
2023-11-28 19:08 ` [PATCH 19/24] pack-bitmap: prepare to mark objects from multiple packs for reuse Taylor Blau
2023-11-28 19:08 ` [PATCH 20/24] pack-objects: add tracing for various packfile metrics Taylor Blau
2023-11-28 19:08 ` [PATCH 21/24] t/test-lib-functions.sh: implement `test_trace2_data` helper Taylor Blau
2023-11-28 19:08 ` [PATCH 22/24] pack-objects: allow setting `pack.allowPackReuse` to "single" Taylor Blau
2023-11-28 19:08 ` [PATCH 23/24] pack-bitmap: reuse objects from all disjoint packs Taylor Blau
2023-11-28 19:08 ` [PATCH 24/24] t/perf: add performance tests for multi-pack reuse Taylor Blau
2023-11-30 10:18 ` [PATCH 00/24] pack-objects: multi-pack verbatim reuse Patrick Steinhardt
2023-11-30 19:39   ` Taylor Blau
2023-12-01  8:31     ` Patrick Steinhardt
2023-12-01 20:02       ` Taylor Blau
2023-12-04  8:49         ` Patrick Steinhardt
2023-12-12  8:12 ` Jeff King
2023-12-15 15:37   ` Taylor Blau
2023-12-21 11:13     ` Jeff King
2024-01-04 22:22       ` Taylor Blau
2023-12-14 22:23 ` [PATCH v2 00/26] " Taylor Blau
2023-12-14 22:23   ` [PATCH v2 01/26] pack-objects: free packing_data in more places Taylor Blau
2023-12-14 22:23   ` [PATCH v2 02/26] pack-bitmap-write: deep-clear the `bb_commit` slab Taylor Blau
2023-12-14 22:23   ` [PATCH v2 03/26] pack-bitmap: plug leak in find_objects() Taylor Blau
2023-12-14 22:23   ` [PATCH v2 04/26] midx: factor out `fill_pack_info()` Taylor Blau
2023-12-14 22:23   ` [PATCH v2 05/26] midx: implement `BTMP` chunk Taylor Blau
2023-12-14 22:23   ` [PATCH v2 06/26] midx: implement `midx_locate_pack()` Taylor Blau
2023-12-14 22:23   ` [PATCH v2 07/26] pack-bitmap: pass `bitmapped_pack` struct to pack-reuse functions Taylor Blau
2023-12-14 22:23   ` [PATCH v2 08/26] ewah: implement `bitmap_is_empty()` Taylor Blau
2023-12-14 22:24   ` [PATCH v2 09/26] pack-bitmap: simplify `reuse_partial_packfile_from_bitmap()` signature Taylor Blau
2023-12-14 22:24   ` [PATCH v2 10/26] pack-bitmap: return multiple packs via `reuse_partial_packfile_from_bitmap()` Taylor Blau
2023-12-14 22:24   ` [PATCH v2 11/26] pack-objects: parameterize pack-reuse routines over a single pack Taylor Blau
2023-12-14 22:24   ` [PATCH v2 12/26] pack-objects: keep track of `pack_start` for each reuse pack Taylor Blau
2023-12-14 22:24   ` [PATCH v2 13/26] pack-objects: pass `bitmapped_pack`'s to pack-reuse functions Taylor Blau
2023-12-14 22:24   ` [PATCH v2 14/26] pack-objects: prepare `write_reused_pack()` for multi-pack reuse Taylor Blau
2023-12-14 22:24   ` [PATCH v2 15/26] pack-objects: prepare `write_reused_pack_verbatim()` " Taylor Blau
2023-12-14 22:24   ` [PATCH v2 16/26] pack-objects: include number of packs reused in output Taylor Blau
2023-12-14 22:24   ` [PATCH v2 17/26] git-compat-util.h: implement checked size_t to uint32_t conversion Taylor Blau
2023-12-14 22:24   ` [PATCH v2 18/26] midx: implement `midx_preferred_pack()` Taylor Blau
2023-12-14 22:24   ` [PATCH v2 19/26] pack-revindex: factor out `midx_key_to_pack_pos()` helper Taylor Blau
2023-12-14 22:24   ` [PATCH v2 20/26] pack-revindex: implement `midx_pair_to_pack_pos()` Taylor Blau
2023-12-14 22:24   ` [PATCH v2 21/26] pack-bitmap: prepare to mark objects from multiple packs for reuse Taylor Blau
2023-12-14 22:24   ` [PATCH v2 22/26] pack-objects: add tracing for various packfile metrics Taylor Blau
2023-12-14 22:24   ` [PATCH v2 23/26] t/test-lib-functions.sh: implement `test_trace2_data` helper Taylor Blau
2023-12-14 22:24   ` [PATCH v2 24/26] pack-objects: allow setting `pack.allowPackReuse` to "single" Taylor Blau
2023-12-14 22:24   ` [PATCH v2 25/26] pack-bitmap: enable reuse from all bitmapped packs Taylor Blau
2023-12-14 22:24   ` [PATCH v2 26/26] t/perf: add performance tests for multi-pack reuse Taylor Blau
2023-12-15  0:06   ` [PATCH v2 00/26] pack-objects: multi-pack verbatim reuse Junio C Hamano
2023-12-15  0:38     ` Taylor Blau
2023-12-15  0:40     ` Junio C Hamano
2023-12-15  1:25       ` Taylor Blau
2023-12-21 10:51         ` Jeff King
2024-01-04 22:24           ` Taylor Blau

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).