From: Patrick Steinhardt <ps@pks.im>
To: Taylor Blau <me@ttaylorr.com>
Cc: git@vger.kernel.org, Jeff King <peff@peff.net>,
Junio C Hamano <gitster@pobox.com>
Subject: Re: [PATCH 00/24] pack-objects: multi-pack verbatim reuse
Date: Thu, 30 Nov 2023 11:18:19 +0100 [thread overview]
Message-ID: <ZWhha2h2zzZWCXrw@tanuki> (raw)
In-Reply-To: <cover.1701198172.git.me@ttaylorr.com>
[-- Attachment #1: Type: text/plain, Size: 10142 bytes --]
On Tue, Nov 28, 2023 at 02:07:54PM -0500, Taylor Blau wrote:
> Back in fff42755ef (pack-bitmap: add support for bitmap indexes,
> 2013-12-21), we added support for reachability bitmaps, and taught
> pack-objects how to reuse verbatim chunks from the bitmapped pack. When
> multi-pack bitmaps were introduced, this pack-reuse mechanism evolved to
> use the MIDX's "preferred" pack as the source for verbatim reuse.
>
> This allows repositories to incrementally repack themselves (e.g., using
> a `--geometric` repack), storing the result in a MIDX, and generating a
> corresponding bitmap. This keeps our bitmap coverage up-to-date, while
> maintaining a relatively small number of packs.
>
> However, it is recommended (and matches what we do in production at
> GitHub) that repositories repack themselves all-into-one, and
> generate a corresponding single-pack reachability bitmap. This is done
> for a couple of reasons, but the most relevant one to this series is
> that it enables us to perform verbatim pack-reuse over a complete copy
> of the repository, since the entire repository resides in a single pack
> (and thus is eligible for verbatim pack-reuse).
>
> As repositories grow larger, packing their contents into a single pack
> becomes less feasible. This series extends the pack-reuse mechanism to
> operate over multiple packs which are known ahead of time to be disjoint
> with respect to one another's set of objects.
>
> The implementation has a few components:
>
> - A new MIDX chunk called "Disjoint packfiles" or DISP is introduced
> to keep track of the bitmap position, number of objects, and
> disjointed-ness for each pack contained in the MIDX.
>
> - A new mode for `git multi-pack-index write --stdin-packs` that
> allows specifying disjoint packs, as well as a new option
> `--retain-disjoint` which preserves the set of existing disjoint
> packs in the new MIDX.
>
> - A new pack-objects mode `--ignore-disjoint`, which produces packs
> which are disjoint with respect to the current set of disjoint packs
> (i.e. it discards any objects from the packing list which appear in
> any of the known-disjoint packs).
>
> - A new repack mode, `--extend-disjoint` which causes any new pack(s)
> which are generated to be disjoint with respect to the set of packs
> currently marked as disjoint, minus any pack(s) which are about to
> be deleted.
>
> With all of that in place, the patch series then rewrites all of the
> pack-reuse functions in terms of the new `bitmapped_pack` structure.
> Once we have dropped all of the assumptions stemming from only
> performing pack-reuse over a single candidate pack, we can then enable
> reuse over all of the disjoint packs.
>
> In addition to the many new tests in t5332 added by that series, I tried
> to simulate a "real world" test on git.git by breaking the repository
> into chunks of 1,000 commits (plus their set of reachable objects not
> reachable from earlier chunk(s)) and packing those chunks. This produces
> a large number of packs with the objects from git.git which are known to
> be disjoint with respect to one another.
>
> $ git clone git@github.com:git/git.git base
>
> $ cd base
> $ mv .git/objects/pack/pack-*.idx{,.bak}
> $ git unpack-objects <.git/objects/pack/pack-*.pack
>
> # pack the objects from each successive block of 1k commits
> $ for rev in $(git rev-list --all | awk '(NR) % 1000 == 0' | tac)
> do
> echo $rev |
> git.compile pack-objects --revs --unpacked .git/objects/pack/pack || return 1
> done
> # and grab any stragglers, pruning the unpacked objects
> $ git repack -d
> I then constructed a MIDX and corresponding bitmap
>
> $ find_pack () {
> for idx in .git/objects/pack/pack-*.idx
> do
> git show-index <$idx | grep -q "$1" && basename $idx
> done
> }
> $ preferred="$(find_pack $(git rev-parse HEAD))"
>
> $ ( cd .git/objects/pack && ls -1 *.idx ) | sed -e 's/^/+/g' |
> git.compile multi-pack-index write --bitmap --stdin-packs \
> --preferred-pack=$preferred
> $ git for-each-ref --format='%(objectname)' refs/heads refs/tags >in
>
> With all of that in place, I was able to produce a significant speed-up
> by reusing objects from multiple packs:
>
> $ hyperfine -L v single,multi -n '{v}-pack reuse' 'git.compile -c pack.allowPackReuse={v} pack-objects --revs --stdout --use-bitmap-index --delta-base-offset <in >/dev/null'
> Benchmark 1: single-pack reuse
> Time (mean ± σ): 6.094 s ± 0.023 s [User: 43.723 s, System: 0.358 s]
> Range (min … max): 6.063 s … 6.126 s 10 runs
>
> Benchmark 2: multi-pack reuse
> Time (mean ± σ): 906.5 ms ± 3.2 ms [User: 1081.5 ms, System: 30.9 ms]
> Range (min … max): 903.5 ms … 912.7 ms 10 runs
>
> Summary
> multi-pack reuse ran
> 6.72 ± 0.03 times faster than single-pack reuse
>
> (There are corresponding tests in p5332 that test different sized chunks
> and measure the runtime performance as well as resulting pack size).
>
> Performing verbatim pack reuse naturally trades off between CPU time and
> the resulting pack size. In the above example, the single-pack reuse
> case produces a clone size of ~194 MB on my machine, while the
> multi-pack reuse case produces a clone size closer to ~266 MB, which is
> a ~37% increase in clone size.
Quite exciting, and a tradeoff that may be worth it for Git hosters. I
expect that this is going to be an extreme example of the benefits
provided by your patch series -- do you by any chance also have "real"
numbers that make it possible to quantify the effect a bit better?
No worry if you don't, I'm just curious.
> I think there is still some opportunity to close this gap, since the
> "packing" strategy here is extremely naive. In a production setting, I'm
> sure that there are more well thought out repacking strategies that
> would produce more similar clone sizes.
>
> I considered breaking this series up into smaller chunks, but was
> unsatisfied with the result. Since this series is rather large, if you
> have alternate suggestions on better ways to structure this, please let
> me know.
The series is indeed very involved to review. I only made it up to patch
8/24 and already spent quite some time on it. So I'd certainly welcome
it if this was split up into smaller parts, but don't have a suggestion
as to how this should be done (also because I didn't yet read the other
16 patches).
I'll review the remaining patches at a later point in time.
Patrick
> Thanks in advance for your review!
>
> Taylor Blau (24):
> pack-objects: free packing_data in more places
> pack-bitmap-write: deep-clear the `bb_commit` slab
> pack-bitmap: plug leak in find_objects()
> midx: factor out `fill_pack_info()`
> midx: implement `DISP` chunk
> midx: implement `midx_locate_pack()`
> midx: implement `--retain-disjoint` mode
> pack-objects: implement `--ignore-disjoint` mode
> repack: implement `--extend-disjoint` mode
> pack-bitmap: pass `bitmapped_pack` struct to pack-reuse functions
> pack-bitmap: simplify `reuse_partial_packfile_from_bitmap()` signature
> pack-bitmap: return multiple packs via
> `reuse_partial_packfile_from_bitmap()`
> pack-objects: parameterize pack-reuse routines over a single pack
> pack-objects: keep track of `pack_start` for each reuse pack
> pack-objects: pass `bitmapped_pack`'s to pack-reuse functions
> pack-objects: prepare `write_reused_pack()` for multi-pack reuse
> pack-objects: prepare `write_reused_pack_verbatim()` for multi-pack
> reuse
> pack-objects: include number of packs reused in output
> pack-bitmap: prepare to mark objects from multiple packs for reuse
> pack-objects: add tracing for various packfile metrics
> t/test-lib-functions.sh: implement `test_trace2_data` helper
> pack-objects: allow setting `pack.allowPackReuse` to "single"
> pack-bitmap: reuse objects from all disjoint packs
> t/perf: add performance tests for multi-pack reuse
>
> Documentation/config/pack.txt | 8 +-
> Documentation/git-multi-pack-index.txt | 12 ++
> Documentation/git-pack-objects.txt | 8 +
> Documentation/git-repack.txt | 12 ++
> Documentation/gitformat-pack.txt | 109 ++++++++++
> builtin/multi-pack-index.c | 13 +-
> builtin/pack-objects.c | 200 +++++++++++++++----
> builtin/repack.c | 57 +++++-
> midx.c | 218 +++++++++++++++++---
> midx.h | 11 +-
> pack-bitmap-write.c | 9 +-
> pack-bitmap.c | 265 ++++++++++++++++++++-----
> pack-bitmap.h | 18 +-
> pack-objects.c | 15 ++
> pack-objects.h | 1 +
> t/helper/test-read-midx.c | 31 ++-
> t/lib-disjoint.sh | 49 +++++
> t/perf/p5332-multi-pack-reuse.sh | 81 ++++++++
> t/t5319-multi-pack-index.sh | 140 +++++++++++++
> t/t5331-pack-objects-stdin.sh | 156 +++++++++++++++
> t/t5332-multi-pack-reuse.sh | 219 ++++++++++++++++++++
> t/t6113-rev-list-bitmap-filters.sh | 2 +
> t/t7700-repack.sh | 4 +-
> t/t7705-repack-extend-disjoint.sh | 142 +++++++++++++
> t/test-lib-functions.sh | 14 ++
> 25 files changed, 1650 insertions(+), 144 deletions(-)
> create mode 100644 t/lib-disjoint.sh
> create mode 100755 t/perf/p5332-multi-pack-reuse.sh
> create mode 100755 t/t5332-multi-pack-reuse.sh
> create mode 100755 t/t7705-repack-extend-disjoint.sh
>
>
> base-commit: 564d0252ca632e0264ed670534a51d18a689ef5d
> --
> 2.43.0.24.g980b318f98
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
next prev parent reply other threads:[~2023-11-30 10:18 UTC|newest]
Thread overview: 107+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-11-28 19:07 [PATCH 00/24] pack-objects: multi-pack verbatim reuse Taylor Blau
2023-11-28 19:07 ` [PATCH 01/24] pack-objects: free packing_data in more places Taylor Blau
2023-11-30 10:18 ` Patrick Steinhardt
2023-11-30 19:08 ` Taylor Blau
2023-11-28 19:07 ` [PATCH 02/24] pack-bitmap-write: deep-clear the `bb_commit` slab Taylor Blau
2023-11-30 10:18 ` Patrick Steinhardt
2023-11-30 19:11 ` Taylor Blau
2023-12-12 7:04 ` Jeff King
2023-12-12 16:48 ` Taylor Blau
2023-11-28 19:08 ` [PATCH 03/24] pack-bitmap: plug leak in find_objects() Taylor Blau
2023-12-12 7:04 ` Jeff King
2023-11-28 19:08 ` [PATCH 04/24] midx: factor out `fill_pack_info()` Taylor Blau
2023-11-30 10:18 ` Patrick Steinhardt
2023-11-30 19:19 ` Taylor Blau
2023-11-28 19:08 ` [PATCH 05/24] midx: implement `DISP` chunk Taylor Blau
2023-11-30 10:18 ` Patrick Steinhardt
2023-11-30 19:27 ` Taylor Blau
2023-12-03 13:15 ` Junio C Hamano
2023-12-05 19:26 ` Taylor Blau
2023-12-09 1:40 ` Junio C Hamano
2023-12-09 2:30 ` Taylor Blau
2023-12-12 8:03 ` Jeff King
2023-12-13 18:28 ` Taylor Blau
2023-12-13 19:20 ` Junio C Hamano
2023-11-28 19:08 ` [PATCH 06/24] midx: implement `midx_locate_pack()` Taylor Blau
2023-11-28 19:08 ` [PATCH 07/24] midx: implement `--retain-disjoint` mode Taylor Blau
2023-11-30 10:18 ` Patrick Steinhardt
2023-11-30 19:29 ` Taylor Blau
2023-12-01 8:02 ` Patrick Steinhardt
2023-11-28 19:08 ` [PATCH 08/24] pack-objects: implement `--ignore-disjoint` mode Taylor Blau
2023-11-30 10:18 ` Patrick Steinhardt
2023-11-30 19:32 ` Taylor Blau
2023-12-01 8:17 ` Patrick Steinhardt
2023-12-01 19:58 ` Taylor Blau
2023-11-28 19:08 ` [PATCH 09/24] repack: implement `--extend-disjoint` mode Taylor Blau
2023-12-07 13:13 ` Patrick Steinhardt
2023-12-07 20:28 ` Taylor Blau
2023-12-08 8:19 ` Patrick Steinhardt
2023-12-08 22:48 ` Taylor Blau
2023-12-11 8:18 ` Patrick Steinhardt
2023-12-11 19:59 ` Taylor Blau
2023-11-28 19:08 ` [PATCH 10/24] pack-bitmap: pass `bitmapped_pack` struct to pack-reuse functions Taylor Blau
2023-12-07 13:13 ` Patrick Steinhardt
2023-12-07 20:34 ` Taylor Blau
2023-12-08 8:19 ` Patrick Steinhardt
2023-11-28 19:08 ` [PATCH 11/24] pack-bitmap: simplify `reuse_partial_packfile_from_bitmap()` signature Taylor Blau
2023-12-07 13:13 ` Patrick Steinhardt
2023-12-07 14:36 ` Taylor Blau
2023-11-28 19:08 ` [PATCH 12/24] pack-bitmap: return multiple packs via `reuse_partial_packfile_from_bitmap()` Taylor Blau
2023-11-28 19:08 ` [PATCH 13/24] pack-objects: parameterize pack-reuse routines over a single pack Taylor Blau
2023-11-28 19:08 ` [PATCH 14/24] pack-objects: keep track of `pack_start` for each reuse pack Taylor Blau
2023-12-07 13:13 ` Patrick Steinhardt
2023-12-07 20:43 ` Taylor Blau
2023-11-28 19:08 ` [PATCH 15/24] pack-objects: pass `bitmapped_pack`'s to pack-reuse functions Taylor Blau
2023-11-28 19:08 ` [PATCH 16/24] pack-objects: prepare `write_reused_pack()` for multi-pack reuse Taylor Blau
2023-12-07 13:13 ` Patrick Steinhardt
2023-12-07 20:47 ` Taylor Blau
2023-11-28 19:08 ` [PATCH 17/24] pack-objects: prepare `write_reused_pack_verbatim()` " Taylor Blau
2023-11-28 19:08 ` [PATCH 18/24] pack-objects: include number of packs reused in output Taylor Blau
2023-11-28 19:08 ` [PATCH 19/24] pack-bitmap: prepare to mark objects from multiple packs for reuse Taylor Blau
2023-11-28 19:08 ` [PATCH 20/24] pack-objects: add tracing for various packfile metrics Taylor Blau
2023-11-28 19:08 ` [PATCH 21/24] t/test-lib-functions.sh: implement `test_trace2_data` helper Taylor Blau
2023-11-28 19:08 ` [PATCH 22/24] pack-objects: allow setting `pack.allowPackReuse` to "single" Taylor Blau
2023-11-28 19:08 ` [PATCH 23/24] pack-bitmap: reuse objects from all disjoint packs Taylor Blau
2023-11-28 19:08 ` [PATCH 24/24] t/perf: add performance tests for multi-pack reuse Taylor Blau
2023-11-30 10:18 ` Patrick Steinhardt [this message]
2023-11-30 19:39 ` [PATCH 00/24] pack-objects: multi-pack verbatim reuse Taylor Blau
2023-12-01 8:31 ` Patrick Steinhardt
2023-12-01 20:02 ` Taylor Blau
2023-12-04 8:49 ` Patrick Steinhardt
2023-12-12 8:12 ` Jeff King
2023-12-15 15:37 ` Taylor Blau
2023-12-21 11:13 ` Jeff King
2024-01-04 22:22 ` Taylor Blau
2023-12-14 22:23 ` [PATCH v2 00/26] " Taylor Blau
2023-12-14 22:23 ` [PATCH v2 01/26] pack-objects: free packing_data in more places Taylor Blau
2023-12-14 22:23 ` [PATCH v2 02/26] pack-bitmap-write: deep-clear the `bb_commit` slab Taylor Blau
2023-12-14 22:23 ` [PATCH v2 03/26] pack-bitmap: plug leak in find_objects() Taylor Blau
2023-12-14 22:23 ` [PATCH v2 04/26] midx: factor out `fill_pack_info()` Taylor Blau
2023-12-14 22:23 ` [PATCH v2 05/26] midx: implement `BTMP` chunk Taylor Blau
2023-12-14 22:23 ` [PATCH v2 06/26] midx: implement `midx_locate_pack()` Taylor Blau
2023-12-14 22:23 ` [PATCH v2 07/26] pack-bitmap: pass `bitmapped_pack` struct to pack-reuse functions Taylor Blau
2023-12-14 22:23 ` [PATCH v2 08/26] ewah: implement `bitmap_is_empty()` Taylor Blau
2023-12-14 22:24 ` [PATCH v2 09/26] pack-bitmap: simplify `reuse_partial_packfile_from_bitmap()` signature Taylor Blau
2023-12-14 22:24 ` [PATCH v2 10/26] pack-bitmap: return multiple packs via `reuse_partial_packfile_from_bitmap()` Taylor Blau
2023-12-14 22:24 ` [PATCH v2 11/26] pack-objects: parameterize pack-reuse routines over a single pack Taylor Blau
2023-12-14 22:24 ` [PATCH v2 12/26] pack-objects: keep track of `pack_start` for each reuse pack Taylor Blau
2023-12-14 22:24 ` [PATCH v2 13/26] pack-objects: pass `bitmapped_pack`'s to pack-reuse functions Taylor Blau
2023-12-14 22:24 ` [PATCH v2 14/26] pack-objects: prepare `write_reused_pack()` for multi-pack reuse Taylor Blau
2023-12-14 22:24 ` [PATCH v2 15/26] pack-objects: prepare `write_reused_pack_verbatim()` " Taylor Blau
2023-12-14 22:24 ` [PATCH v2 16/26] pack-objects: include number of packs reused in output Taylor Blau
2023-12-14 22:24 ` [PATCH v2 17/26] git-compat-util.h: implement checked size_t to uint32_t conversion Taylor Blau
2023-12-14 22:24 ` [PATCH v2 18/26] midx: implement `midx_preferred_pack()` Taylor Blau
2023-12-14 22:24 ` [PATCH v2 19/26] pack-revindex: factor out `midx_key_to_pack_pos()` helper Taylor Blau
2023-12-14 22:24 ` [PATCH v2 20/26] pack-revindex: implement `midx_pair_to_pack_pos()` Taylor Blau
2023-12-14 22:24 ` [PATCH v2 21/26] pack-bitmap: prepare to mark objects from multiple packs for reuse Taylor Blau
2023-12-14 22:24 ` [PATCH v2 22/26] pack-objects: add tracing for various packfile metrics Taylor Blau
2023-12-14 22:24 ` [PATCH v2 23/26] t/test-lib-functions.sh: implement `test_trace2_data` helper Taylor Blau
2023-12-14 22:24 ` [PATCH v2 24/26] pack-objects: allow setting `pack.allowPackReuse` to "single" Taylor Blau
2023-12-14 22:24 ` [PATCH v2 25/26] pack-bitmap: enable reuse from all bitmapped packs Taylor Blau
2023-12-14 22:24 ` [PATCH v2 26/26] t/perf: add performance tests for multi-pack reuse Taylor Blau
2023-12-15 0:06 ` [PATCH v2 00/26] pack-objects: multi-pack verbatim reuse Junio C Hamano
2023-12-15 0:38 ` Taylor Blau
2023-12-15 0:40 ` Junio C Hamano
2023-12-15 1:25 ` Taylor Blau
2023-12-21 10:51 ` Jeff King
2024-01-04 22:24 ` Taylor Blau
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZWhha2h2zzZWCXrw@tanuki \
--to=ps@pks.im \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=me@ttaylorr.com \
--cc=peff@peff.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).