git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Taylor Blau <me@ttaylorr.com>
To: Jeff King <peff@peff.net>
Cc: git@vger.kernel.org, Elijah Newren <newren@gmail.com>,
	Junio C Hamano <gitster@pobox.com>,
	Patrick Steinhardt <ps@pks.im>
Subject: Re: [PATCH 00/11] pack-bitmap: convert offset to ref deltas where possible
Date: Tue, 19 Nov 2024 18:08:24 -0500	[thread overview]
Message-ID: <Zz0aaPdHiFyoRkKg@nand.local> (raw)
In-Reply-To: <20241011083838.GI18010@coredump.intra.peff.net>

On Fri, Oct 11, 2024 at 04:38:38AM -0400, Jeff King wrote:
> On Wed, Oct 09, 2024 at 04:30:52PM -0400, Taylor Blau wrote:
>
> > The first optimization (cross-pack deltas) should help clones and
> > fetches, but the second optimization (thin deltas) will only help
> > fetches, since the 'haves' bitmap will necessarily be empty for
> > clones.
> >
> > Of course, REF_DELTAs have a less compact representation than
> > OFS_DELTAs, so the resulting packs will trade off some CPU time for a
> > slightly larger pack. But we're only talking about a handful of extra
> > bytes, and network bandwidth is pretty cheap, so I think the
> > optimization is worthwhile here too.
>
> It would be nice to see some numbers, even simulated ones from t/perf.
> Of course we'd like to show off any reduced server CPU for generating a
> fetch response. But I'd also like to see if the extra steps to find the
> cross-pack bases have any measurable negative effect (so the ideal there
> would be a big midx repo that doesn't have a lot of those bases, as it
> would not be helped much by the optimization and would be hurt by the
> time spent trying to apply it).

I put together some loose numbers for this, and there is a definite
measurable slowdown imposed by this series.

The setup is getting a fresh copy of the kernel, and then repacking it
with:

    $ git repack -da --cruft --write-midx --write-bitmap-index \
        --max-pack-size 1G --max-cruft-size 1G

, which on my machine produces the following pack structure:

    $ for idx in .git/objects/pack/*.idx
      do
        echo $(basename $idx) \
          $(git show-index <$idx | wc -l)
      done | sort -rnk2
    pack-8dc61c00e623765c54cbd332f31795bad7edf142.idx 3385859
    pack-d84e88715cde55d4d1b09afb23254617a123e668.idx 2064396
    pack-cd408bebd5c02719df7621941d92415f4dbb313c.idx 1805431
    pack-b11c6ac57f5ecf338314ee59134b0d724d257494.idx 1036443
    pack-a42cc1b409940d7e38a4673bca1150d320392bf0.idx 819701
    pack-93d255a1000f438a77528907c480c06b277e225d.idx 755019
    pack-600621b90dc29c761ee168cc4237b2ff2956b0e8.idx 609754

Generating a full clone[^1] with multi-pack reuse enabled (based on the tip
of tb/multi-pack-reuse-dupfix) gives me the following timings:

    13.94s user 0.27s system 99% cpu 14.207 total
    13.93s user 0.30s system 99% cpu 14.223 total
    13.90s user 0.33s system 99% cpu 14.229 total

After applying the series, I get:

    14.94s user 0.33s system 99% cpu 15.279 total
    14.92s user 0.32s system 99% cpu 15.244 total
    14.87s user 0.40s system 99% cpu 15.280 total

Or around a ~5% slowdown. I think what's really killing us is all of the
back and forth in pack-bitmap.c::find_base_bitmap_pos(), where we have
to:

 - Find the pack-relative position of the base object given its offset.
 - Convert that pack-relative position into a lexical position (still
   relative to the source pack).
 - Convert that lexical position into an object ID.
 - Call bsearch_midx() to see if we have a copy of that object, and
   record its MIDX-relative lexical position.
 - Convert the MIDX-relative lexical position into its pseudo-pack
   position.

I think that the first four of those steps are unavoidable. But I think
we can do better on the last step if we compute a forward index from
MIDX lexical position to pseudo-pack position.

A cheap way to run that experiment is to just add a temporary field into
the MIDX to store that mapping, compute it at runtime, and then deduct
the time it took to compute that mapping (presumably you'd do it during
MIDX creation, and store it in a chunk, etc.).

Here's some code to do that:

--- 8< ---
diff --git a/midx.c b/midx.c
index ca98bfd7c64..2bd892f899c 100644
--- a/midx.c
+++ b/midx.c
@@ -988,3 +988,18 @@ int verify_midx_file(struct repository *r, const char *object_dir, unsigned flag

 	return verify_midx_error;
 }
+
+void midx_populate_forward_index(struct multi_pack_index *m)
+{
+	uint32_t i;
+
+	ALLOC_ARRAY(m->forward_idx, m->num_objects);
+
+	trace2_region_enter("midx", "populate_forward_index", the_repository);
+
+	for (i = 0; i < m->num_objects; i++)
+		if (midx_to_pack_pos(m, i, m->forward_idx + i) < 0)
+			BUG("oops");
+
+	trace2_region_leave("midx", "populate_forward_index", the_repository);
+}
diff --git a/midx.h b/midx.h
index 42d4f8d149e..1ef755fe45b 100644
--- a/midx.h
+++ b/midx.h
@@ -65,6 +65,8 @@ struct multi_pack_index {
 	const unsigned char *chunk_revindex;
 	size_t chunk_revindex_len;

+	uint32_t *forward_idx;
+
 	struct multi_pack_index *base_midx;
 	uint32_t num_objects_in_base;
 	uint32_t num_packs_in_base;
@@ -74,6 +76,8 @@ struct multi_pack_index {
 	char object_dir[FLEX_ARRAY];
 };

+void midx_populate_forward_index(struct multi_pack_index *m);
+
 #define MIDX_PROGRESS     (1 << 0)
 #define MIDX_WRITE_REV_INDEX (1 << 1)
 #define MIDX_WRITE_BITMAP (1 << 2)
diff --git a/pack-bitmap.c b/pack-bitmap.c
index e8094453ca3..1eb72347c02 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -2092,9 +2092,13 @@ static int find_base_bitmap_pos(struct bitmap_index *bitmap_git,
 		if (!bsearch_midx(&base_oid, bitmap_git->midx,
 				  &base_midx_pos))
 			return -1;
+#if 0
 		if (midx_to_pack_pos(bitmap_git->midx, base_midx_pos,
 				     base_bitmap_pos) < 0)
 			return -1;
+#else
+		*base_bitmap_pos = bitmap_git->midx->forward_idx[base_midx_pos];
+#endif
 	} else {
 		/*
 		 * We assume delta dependencies always point backwards.
@@ -2316,6 +2320,7 @@ void reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 	assert(result);

 	load_reverse_index(r, bitmap_git);
+	midx_populate_forward_index(bitmap_git->midx);

 	if (!bitmap_is_midx(bitmap_git) || !bitmap_git->midx->chunk_bitmapped_packs)
 		multi_pack_reuse = 0;
--- >8 ---

Then when running the same command, we get results that are quite
encouraging. The runtime jumps to 24.213 seconds, which is ~9.73 seconds
slower than the average of the baseline measurements. But it takes
~10.418 seconds on my machine to compute the forward index. So it's
really around 688ms *faster* than the baseline, despite doing a little
more work.

The fact that we can improve on the baseline rather than just meet it
suggests that implementing this forward index may have some benefit
outside of just this series.

Thanks,
Taylor

[^1]: The setup here is:

    $ git for-each-ref --format='%(objectname)' refs/heads refs/tags >in
    $ git pack-objects --stdout --delta-base-offset --use-bitmap-index --revs <in >/dev/null

  reply	other threads:[~2024-11-19 23:08 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-09 20:30 [PATCH 00/11] pack-bitmap: convert offset to ref deltas where possible Taylor Blau
2024-10-09 20:31 ` [PATCH 01/11] pack-bitmap.c: do not pass `pack_pos` to `try_partial_reuse()` Taylor Blau
2024-10-09 20:31 ` [PATCH 02/11] pack-bitmap.c: avoid unnecessary `offset_to_pack_pos()` Taylor Blau
2024-10-09 20:31 ` [PATCH 03/11] pack-bitmap.c: delay calling 'offset_to_pack_pos()' Taylor Blau
2024-10-09 20:31 ` [PATCH 04/11] pack-bitmap.c: compare `base_offset` to `delta_obj_offset` Taylor Blau
2024-10-09 20:31 ` [PATCH 05/11] pack-bitmap.c: extract `find_base_bitmap_pos()` Taylor Blau
2024-10-09 20:31 ` [PATCH 06/11] pack-bitmap: drop `from_midx` field from `bitmapped_pack` Taylor Blau
2024-10-09 20:31 ` [PATCH 07/11] write_reused_pack_one(): translate bit positions directly Taylor Blau
2024-10-11  8:16   ` Jeff King
2024-11-04 20:36     ` Taylor Blau
2024-10-09 20:31 ` [PATCH 08/11] t5332: enable OFS_DELTAs via test_pack_objects_reused Taylor Blau
2024-10-11  8:19   ` Jeff King
2024-11-04 20:50     ` Taylor Blau
2024-10-09 20:31 ` [PATCH 09/11] pack-bitmap: enable cross-pack delta reuse Taylor Blau
2024-10-11  8:31   ` Jeff King
2024-11-04 21:00     ` Taylor Blau
2024-10-09 20:31 ` [PATCH 10/11] pack-bitmap.c: record whether the result was filtered Taylor Blau
2024-10-11  8:35   ` Jeff King
2024-11-04 21:01     ` Taylor Blau
2024-10-09 20:31 ` [PATCH 11/11] pack-bitmap: enable reusing deltas with base objects in 'haves' bitmap Taylor Blau
2024-10-10 16:46 ` [PATCH 00/11] pack-bitmap: convert offset to ref deltas where possible Junio C Hamano
2024-10-10 17:07   ` Taylor Blau
2024-10-10 20:20     ` Junio C Hamano
2024-10-10 20:32       ` Taylor Blau
2024-10-11  7:54       ` Jeff King
2024-10-11  8:01         ` Jeff King
2024-10-11 16:23           ` Junio C Hamano
2024-10-11  8:38 ` Jeff King
2024-11-19 23:08   ` Taylor Blau [this message]
2024-11-19 23:34     ` Taylor Blau
2024-12-18 12:57       ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Zz0aaPdHiFyoRkKg@nand.local \
    --to=me@ttaylorr.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=newren@gmail.com \
    --cc=peff@peff.net \
    --cc=ps@pks.im \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).