Git development
 help / color / mirror / Atom feed
From: Taylor Blau <me@ttaylorr.com>
To: Jeff King <peff@peff.net>
Cc: git@vger.kernel.org, Junio C Hamano <gitster@pobox.com>,
	Elijah Newren <newren@gmail.com>,
	Derrick Stolee <stolee@gmail.com>
Subject: Re: [PATCH 6/8] pack-bitmap: sort bitmaps before XORing
Date: Wed, 27 May 2026 12:56:00 -0400	[thread overview]
Message-ID: <ahciIDuESxNa9Fzn@nand.local> (raw)
In-Reply-To: <20260527100406.GG981444@coredump.intra.peff.net>

On Wed, May 27, 2026 at 06:04:06AM -0400, Jeff King wrote:
> On Tue, May 19, 2026 at 12:12:50PM -0400, Taylor Blau wrote:
>
> > Reachability bitmaps may be stored as XORs against nearby bitmaps, up to
> > 10 away. However, when callers provide selected commits in an arbitrary
> > order, the writer may miss good ancestor/descendant pairs and produce
> > much larger bitmap files without changing query coverage.
> >
> > Sort the selected bitmaps in date order (from oldest to newest) before
> > computing XOR offsets, leaving pseudo-merge bitmaps alone (which we will
> > deal with separately in following commits).
>
> That order certainly makes the most sense. I'd have thought we ended up
> there incidentally because of the order in which we consider the
> commits, but perhaps not. I wonder if this got much worse when we
> re-wrote the bitmap generation code a few years ago.
>
> That was in v2.31.0, I think. Repacking linux.git with bitmaps, though,
> I couldn't find any difference in size between v2.30 and v2.31. They're
> both ~67M. But that also didn't shrink with this patch, either.
>
> If you have some spare CPU cycles to burn, I would be interested in a
> comparison of the bitmap size of your test repo using v2.30.0, v2.31.1,
> and this patch.

I started running this experiment, but I don't think I actually have
enough CPU cycles to let it finish ;-). Pre-v2.31 bitmap generation is
*really* slow[^1], and after multiple hours (forcing the same selection
of bitmaps by back-porting and adjusting 'test-tool bitmap') I couldn't
seem to make any meaningful progress.

I'm sure that you could get some plausible numbers out of benchmarking
this on a smaller repository. In case you're interested, here's the
patch I wrote on top of v2.30.0:

--- 8< ---
diff --git a/Makefile b/Makefile
index 7b64106930a..9ce9f2b483c 100644
--- a/Makefile
+++ b/Makefile
@@ -690,6 +690,7 @@ X =
 PROGRAMS += $(patsubst %.o,git-%$X,$(PROGRAM_OBJS))

 TEST_BUILTINS_OBJS += test-advise.o
+TEST_BUILTINS_OBJS += test-bitmap.o
 TEST_BUILTINS_OBJS += test-bloom.o
 TEST_BUILTINS_OBJS += test-chmtime.o
 TEST_BUILTINS_OBJS += test-config.o
diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c
index 5e998bdaa79..55dbb475120 100644
--- a/pack-bitmap-write.c
+++ b/pack-bitmap-write.c
@@ -113,7 +113,7 @@ void bitmap_writer_build_type_index(struct packing_data *to_pack,
 static struct object **seen_objects;
 static unsigned int seen_objects_nr, seen_objects_alloc;

-static inline void push_bitmapped_commit(struct commit *commit, struct ewah_bitmap *reused)
+void bitmap_writer_push_commit(struct commit *commit, struct ewah_bitmap *reused)
 {
 	if (writer.selected_nr >= writer.selected_alloc) {
 		writer.selected_alloc = (writer.selected_alloc + 32) * 2;
@@ -402,7 +402,7 @@ void bitmap_writer_select_commits(struct commit **indexed_commits,

 	if (indexed_commits_nr < 100) {
 		for (i = 0; i < indexed_commits_nr; ++i)
-			push_bitmapped_commit(indexed_commits[i], NULL);
+			bitmap_writer_push_commit(indexed_commits[i], NULL);
 		return;
 	}

@@ -440,7 +440,7 @@ void bitmap_writer_select_commits(struct commit **indexed_commits,
 			}
 		}

-		push_bitmapped_commit(chosen, reused_bitmap);
+		bitmap_writer_push_commit(chosen, reused_bitmap);

 		i += next + 1;
 		display_progress(writer.progress, i);
diff --git a/pack-bitmap.h b/pack-bitmap.h
index 1203120c432..a882efdb16a 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -74,6 +74,8 @@ void bitmap_writer_build_type_index(struct packing_data *to_pack,
 				    struct pack_idx_entry **index,
 				    uint32_t index_nr);
 void bitmap_writer_reuse_bitmaps(struct packing_data *to_pack);
+void bitmap_writer_push_commit(struct commit *commit,
+			       struct ewah_bitmap *reused);
 void bitmap_writer_select_commits(struct commit **indexed_commits,
 		unsigned int indexed_commits_nr, int max_bitmaps);
 void bitmap_writer_build(struct packing_data *to_pack);
diff --git a/t/helper/test-bitmap.c b/t/helper/test-bitmap.c
new file mode 100644
index 00000000000..9be03377c06
--- /dev/null
+++ b/t/helper/test-bitmap.c
@@ -0,0 +1,119 @@
+#define USE_THE_REPOSITORY_VARIABLE
+
+#include "test-tool.h"
+#include "git-compat-util.h"
+#include "commit.h"
+#include "packfile.h"
+#include "pack-bitmap.h"
+
+static int add_packed_object(const struct object_id *oid,
+			     struct packed_git *pack,
+			     uint32_t pos,
+			     void *_data)
+{
+	struct packing_data *packed = _data;
+	struct object_entry *entry;
+	struct object_info oi = OBJECT_INFO_INIT;
+	enum object_type type;
+
+	oi.typep = &type;
+
+	entry = packlist_alloc(packed, oid);
+	entry->in_pack_offset = nth_packed_object_offset(pack, pos);
+	entry->idx.offset = entry->in_pack_offset;
+	if (packed_object_info(the_repository, pack, entry->in_pack_offset, &oi) < 0)
+		die("could not get type of object %s",
+		    oid_to_hex(oid));
+	oe_set_type(entry, type);
+	oe_set_in_pack(packed, entry, pack);
+
+	return 0;
+}
+
+static int idx_oid_cmp(const void *va, const void *vb)
+{
+	const struct pack_idx_entry *a = *(const struct pack_idx_entry **)va;
+	const struct pack_idx_entry *b = *(const struct pack_idx_entry **)vb;
+
+	return oidcmp(&a->oid, &b->oid);
+}
+
+static int bitmap_write(const char *basename)
+{
+	struct packed_git *p = NULL;
+	struct packing_data packed = { 0 };
+	struct pack_idx_entry **index;
+	struct strbuf buf = STRBUF_INIT;
+	uint32_t i;
+
+	prepare_repo_settings(the_repository);
+	for (p = get_all_packs(the_repository); p; p = p->next) {
+		if (!strcmp(pack_basename(p), basename))
+			break;
+	}
+
+	if (!p)
+		die("could not find pack '%s'", basename);
+
+	if (open_pack_index(p))
+		die("cannot open pack index for '%s'", p->pack_name);
+
+	prepare_packing_data(the_repository, &packed);
+
+	for_each_object_in_pack(p, add_packed_object, &packed,
+				FOR_EACH_OBJECT_PACK_ORDER);
+
+	/*
+	 * Build the index array now that data.packed.objects[] is
+	 * fully allocated (packlist_alloc() may have reallocated it
+	 * during the loop above).
+	 */
+	ALLOC_ARRAY(index, p->num_objects);
+	for (i = 0; i < p->num_objects; i++)
+		index[i] = &packed.objects[i].idx;
+
+	bitmap_writer_build_type_index(&packed, index, p->num_objects);
+
+	while (strbuf_getline_lf(&buf, stdin) != EOF) {
+		struct object_id oid;
+		struct commit *c;
+
+		if (get_oid_hex(buf.buf, &oid))
+			die("invalid OID: %s", buf.buf);
+
+		c = lookup_commit(the_repository, &oid);
+		if (!c || repo_parse_commit(the_repository, c))
+			die("could not parse commit %s", buf.buf);
+
+		bitmap_writer_push_commit(c, NULL);
+	}
+
+	bitmap_writer_build(&packed);
+
+	bitmap_writer_set_checksum(p->hash);
+
+	QSORT(index, p->num_objects, idx_oid_cmp);
+
+	strbuf_reset(&buf);
+	strbuf_addstr(&buf, p->pack_name);
+	strbuf_strip_suffix(&buf, ".pack");
+	strbuf_addstr(&buf, ".bitmap");
+	bitmap_writer_finish(index, p->num_objects, buf.buf, 0);
+
+	strbuf_release(&buf);
+	free(index);
+
+	return 0;
+}
+
+int cmd__bitmap(int argc, const char **argv)
+{
+	setup_git_directory();
+
+	if (argc == 3 && !strcmp(argv[1], "write"))
+		return bitmap_write(argv[2]);
+
+	usage("\ttest-tool bitmap write <pack-basename> < <commit-list>");
+
+	return -1;
+}
diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
index 9d6d14d9293..c43d8c0977b 100644
--- a/t/helper/test-tool.c
+++ b/t/helper/test-tool.c
@@ -15,6 +15,7 @@ struct test_cmd {

 static struct test_cmd cmds[] = {
 	{ "advise", cmd__advise_if_enabled },
+	{ "bitmap", cmd__bitmap },
 	{ "bloom", cmd__bloom },
 	{ "chmtime", cmd__chmtime },
 	{ "config", cmd__config },
diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index a6470ff62c4..27e6e40ffcb 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h
@@ -5,6 +5,7 @@
 #include "git-compat-util.h"

 int cmd__advise_if_enabled(int argc, const char **argv);
+int cmd__bitmap(int argc, const char **argv);
 int cmd__bloom(int argc, const char **argv);
 int cmd__chmtime(int argc, const char **argv);
 int cmd__config(int argc, const char **argv);
--- >8 ---

> > On our same testing repository from previous commits, this change shrunk
> > our selection of 1,261 bitmaps from ~635.46 MiB to 176.4 MiB for a
> > ~72.24% reduction in the on-disk size of our *.bitmap file. The time to
> > generate the smaller bitmap file decreased by ~3.69 seconds, though this
> > is likely mostly noise.
>
> Certainly good numbers. The obvious follow-up question is: how does the
> reading side fare? I'd expect it to be a little better, if only because
> there are fewer bytes to consider when XOR-ing. But if there's some
> hidden assumption we're missing, then it could get wildly worse. It
> would be good to confirm that that didn't happen. ;)

It doesn't make a huge difference. Prior to this patch, the timings on
my test repository for 'git rev-list --count --all --objects
--use-bitmap-index' go from:

 -  2.180 ± 0.019 seconds (with pseudo-merges)
 - 52.149 ± 0.224 seconds (without pseudo-merges)

, and after applying this patch, it changes to:

 -  2.611 ± 0.023 seconds (with pseudo-merges)
 - 51.963 ± 0.131 seconds (without pseudo-merges)

It looks like there is a minor slow-down on pseudo-merges, and a minor
speed-up without them. The difference is small enough that I'm willing
to treat it as run-to-run noise.

> >  static void compute_xor_offsets(struct bitmap_writer *writer)
> >  {
> >  	static const int MAX_XOR_OFFSET_SEARCH = 10;
> >
> >  	int i, next = 0;
> > +	int nr = bitmap_writer_nr_selected_commits(writer);
> > +
> > +	if (nr > 1) {
> > +		QSORT(writer->selected, nr, bitmapped_commit_date_cmp);
> > +
> > +		for (i = 0; i < nr; i++) {
> > +			struct bitmapped_commit *stored = &writer->selected[i];
> > +			khiter_t hash_pos = kh_get_oid_map(writer->bitmaps,
> > +							   stored->commit->object.oid);
> > +
> > +			if (hash_pos == kh_end(writer->bitmaps))
> > +				BUG("selected commit missing from bitmap map: %s",
> > +				    oid_to_hex(&stored->commit->object.oid));
> > +
> > +			kh_value(writer->bitmaps, hash_pos) = stored;
> > +		}
> > +	}
>
> OK. It took me a minute to wrap my head around this. The real work is
> done by QSORT(). But because we maintain a hash pointing into that
> array, we have to go through each hash entry and fix up its pointer.

Yup.

> Looks correct.

Thanks,
Taylor

[^1]: ...and I have great empathy for Stolee's suffering here when
  benchmarking his performance improvements to the bitmap generation
  code from back in the day! ;-).

  reply	other threads:[~2026-05-27 16:56 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-19 16:12 [PATCH 0/8] pack-bitmap-write: speed up bitmap generation Taylor Blau
2026-05-19 16:12 ` [PATCH 1/8] pack-bitmap: pass object position to `fill_bitmap_tree()` Taylor Blau
2026-05-27  8:57   ` Jeff King
2026-05-27 14:36     ` Taylor Blau
2026-05-19 16:12 ` [PATCH 2/8] pack-bitmap: check subtree bits before recursing Taylor Blau
2026-05-27  9:03   ` Jeff King
2026-05-27 14:36     ` Taylor Blau
2026-05-19 16:12 ` [PATCH 3/8] pack-bitmap: reuse stored selected bitmaps Taylor Blau
2026-05-27  9:24   ` Jeff King
2026-05-27 14:40     ` Taylor Blau
2026-05-19 16:12 ` [PATCH 4/8] pack-bitmap: consolidate `find_object_pos()` success path Taylor Blau
2026-05-20 14:42   ` SZEDER Gábor
2026-05-20 17:12     ` Taylor Blau
2026-05-27  9:27   ` Jeff King
2026-05-19 16:12 ` [PATCH 5/8] pack-bitmap: cache object positions during fill Taylor Blau
2026-05-27  9:45   ` Jeff King
2026-05-27 14:46     ` Taylor Blau
2026-05-19 16:12 ` [PATCH 6/8] pack-bitmap: sort bitmaps before XORing Taylor Blau
2026-05-27 10:04   ` Jeff King
2026-05-27 16:56     ` Taylor Blau [this message]
2026-05-19 16:12 ` [PATCH 7/8] pack-bitmap: remember pseudo-merge parents Taylor Blau
2026-05-19 16:12 ` [PATCH 8/8] pack-bitmap: build pseudo-merge bitmaps after regular bitmaps Taylor Blau
2026-05-27 10:25   ` Jeff King
2026-05-27 19:24     ` Taylor Blau
2026-05-27 10:27 ` [PATCH 0/8] pack-bitmap-write: speed up bitmap generation Jeff King
2026-05-27 19:55 ` [PATCH v2 " Taylor Blau
2026-05-27 19:55   ` [PATCH v2 1/8] pack-bitmap: pass object position to `fill_bitmap_tree()` Taylor Blau
2026-05-27 19:55   ` [PATCH v2 2/8] pack-bitmap: check subtree bits before recursing Taylor Blau
2026-05-27 19:55   ` [PATCH v2 3/8] pack-bitmap: reuse stored selected bitmaps Taylor Blau
2026-05-27 19:55   ` [PATCH v2 4/8] pack-bitmap: consolidate `find_object_pos()` success path Taylor Blau
2026-05-27 19:56   ` [PATCH v2 5/8] pack-bitmap: cache object positions during fill Taylor Blau
2026-05-27 19:56   ` [PATCH v2 6/8] pack-bitmap: sort bitmaps before XORing Taylor Blau
2026-05-27 19:56   ` [PATCH v2 7/8] pack-bitmap: remember pseudo-merge parents Taylor Blau
2026-05-27 19:56   ` [PATCH v2 8/8] pack-bitmap: build pseudo-merge bitmaps after regular bitmaps Taylor Blau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ahciIDuESxNa9Fzn@nand.local \
    --to=me@ttaylorr.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=newren@gmail.com \
    --cc=peff@peff.net \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox