From: Taylor Blau <me@ttaylorr.com>
To: git@vger.kernel.org
Cc: Jeff King <peff@peff.net>, Patrick Steinhardt <ps@pks.im>,
Junio C Hamano <gitster@pobox.com>
Subject: [PATCH 00/24] pack-objects: multi-pack verbatim reuse
Date: Tue, 28 Nov 2023 14:07:54 -0500 [thread overview]
Message-ID: <cover.1701198172.git.me@ttaylorr.com> (raw)
Back in fff42755ef (pack-bitmap: add support for bitmap indexes,
2013-12-21), we added support for reachability bitmaps, and taught
pack-objects how to reuse verbatim chunks from the bitmapped pack. When
multi-pack bitmaps were introduced, this pack-reuse mechanism evolved to
use the MIDX's "preferred" pack as the source for verbatim reuse.
This allows repositories to incrementally repack themselves (e.g., using
a `--geometric` repack), storing the result in a MIDX, and generating a
corresponding bitmap. This keeps our bitmap coverage up-to-date, while
maintaining a relatively small number of packs.
However, it is recommended (and matches what we do in production at
GitHub) that repositories repack themselves all-into-one, and
generate a corresponding single-pack reachability bitmap. This is done
for a couple of reasons, but the most relevant one to this series is
that it enables us to perform verbatim pack-reuse over a complete copy
of the repository, since the entire repository resides in a single pack
(and thus is eligible for verbatim pack-reuse).
As repositories grow larger, packing their contents into a single pack
becomes less feasible. This series extends the pack-reuse mechanism to
operate over multiple packs which are known ahead of time to be disjoint
with respect to one another's set of objects.
The implementation has a few components:
- A new MIDX chunk called "Disjoint packfiles" or DISP is introduced
to keep track of the bitmap position, number of objects, and
disjointed-ness for each pack contained in the MIDX.
- A new mode for `git multi-pack-index write --stdin-packs` that
allows specifying disjoint packs, as well as a new option
`--retain-disjoint` which preserves the set of existing disjoint
packs in the new MIDX.
- A new pack-objects mode `--ignore-disjoint`, which produces packs
which are disjoint with respect to the current set of disjoint packs
(i.e. it discards any objects from the packing list which appear in
any of the known-disjoint packs).
- A new repack mode, `--extend-disjoint` which causes any new pack(s)
which are generated to be disjoint with respect to the set of packs
currently marked as disjoint, minus any pack(s) which are about to
be deleted.
With all of that in place, the patch series then rewrites all of the
pack-reuse functions in terms of the new `bitmapped_pack` structure.
Once we have dropped all of the assumptions stemming from only
performing pack-reuse over a single candidate pack, we can then enable
reuse over all of the disjoint packs.
In addition to the many new tests in t5332 added by that series, I tried
to simulate a "real world" test on git.git by breaking the repository
into chunks of 1,000 commits (plus their set of reachable objects not
reachable from earlier chunk(s)) and packing those chunks. This produces
a large number of packs with the objects from git.git which are known to
be disjoint with respect to one another.
$ git clone git@github.com:git/git.git base
$ cd base
$ mv .git/objects/pack/pack-*.idx{,.bak}
$ git unpack-objects <.git/objects/pack/pack-*.pack
# pack the objects from each successive block of 1k commits
$ for rev in $(git rev-list --all | awk '(NR) % 1000 == 0' | tac)
do
echo $rev |
git.compile pack-objects --revs --unpacked .git/objects/pack/pack || return 1
done
# and grab any stragglers, pruning the unpacked objects
$ git repack -d
I then constructed a MIDX and corresponding bitmap
$ find_pack () {
for idx in .git/objects/pack/pack-*.idx
do
git show-index <$idx | grep -q "$1" && basename $idx
done
}
$ preferred="$(find_pack $(git rev-parse HEAD))"
$ ( cd .git/objects/pack && ls -1 *.idx ) | sed -e 's/^/+/g' |
git.compile multi-pack-index write --bitmap --stdin-packs \
--preferred-pack=$preferred
$ git for-each-ref --format='%(objectname)' refs/heads refs/tags >in
With all of that in place, I was able to produce a significant speed-up
by reusing objects from multiple packs:
$ hyperfine -L v single,multi -n '{v}-pack reuse' 'git.compile -c pack.allowPackReuse={v} pack-objects --revs --stdout --use-bitmap-index --delta-base-offset <in >/dev/null'
Benchmark 1: single-pack reuse
Time (mean ± σ): 6.094 s ± 0.023 s [User: 43.723 s, System: 0.358 s]
Range (min … max): 6.063 s … 6.126 s 10 runs
Benchmark 2: multi-pack reuse
Time (mean ± σ): 906.5 ms ± 3.2 ms [User: 1081.5 ms, System: 30.9 ms]
Range (min … max): 903.5 ms … 912.7 ms 10 runs
Summary
multi-pack reuse ran
6.72 ± 0.03 times faster than single-pack reuse
(There are corresponding tests in p5332 that test different sized chunks
and measure the runtime performance as well as resulting pack size).
Performing verbatim pack reuse naturally trades off between CPU time and
the resulting pack size. In the above example, the single-pack reuse
case produces a clone size of ~194 MB on my machine, while the
multi-pack reuse case produces a clone size closer to ~266 MB, which is
a ~37% increase in clone size.
I think there is still some opportunity to close this gap, since the
"packing" strategy here is extremely naive. In a production setting, I'm
sure that there are more well thought out repacking strategies that
would produce more similar clone sizes.
I considered breaking this series up into smaller chunks, but was
unsatisfied with the result. Since this series is rather large, if you
have alternate suggestions on better ways to structure this, please let
me know.
Thanks in advance for your review!
Taylor Blau (24):
pack-objects: free packing_data in more places
pack-bitmap-write: deep-clear the `bb_commit` slab
pack-bitmap: plug leak in find_objects()
midx: factor out `fill_pack_info()`
midx: implement `DISP` chunk
midx: implement `midx_locate_pack()`
midx: implement `--retain-disjoint` mode
pack-objects: implement `--ignore-disjoint` mode
repack: implement `--extend-disjoint` mode
pack-bitmap: pass `bitmapped_pack` struct to pack-reuse functions
pack-bitmap: simplify `reuse_partial_packfile_from_bitmap()` signature
pack-bitmap: return multiple packs via
`reuse_partial_packfile_from_bitmap()`
pack-objects: parameterize pack-reuse routines over a single pack
pack-objects: keep track of `pack_start` for each reuse pack
pack-objects: pass `bitmapped_pack`'s to pack-reuse functions
pack-objects: prepare `write_reused_pack()` for multi-pack reuse
pack-objects: prepare `write_reused_pack_verbatim()` for multi-pack
reuse
pack-objects: include number of packs reused in output
pack-bitmap: prepare to mark objects from multiple packs for reuse
pack-objects: add tracing for various packfile metrics
t/test-lib-functions.sh: implement `test_trace2_data` helper
pack-objects: allow setting `pack.allowPackReuse` to "single"
pack-bitmap: reuse objects from all disjoint packs
t/perf: add performance tests for multi-pack reuse
Documentation/config/pack.txt | 8 +-
Documentation/git-multi-pack-index.txt | 12 ++
Documentation/git-pack-objects.txt | 8 +
Documentation/git-repack.txt | 12 ++
Documentation/gitformat-pack.txt | 109 ++++++++++
builtin/multi-pack-index.c | 13 +-
builtin/pack-objects.c | 200 +++++++++++++++----
builtin/repack.c | 57 +++++-
midx.c | 218 +++++++++++++++++---
midx.h | 11 +-
pack-bitmap-write.c | 9 +-
pack-bitmap.c | 265 ++++++++++++++++++++-----
pack-bitmap.h | 18 +-
pack-objects.c | 15 ++
pack-objects.h | 1 +
t/helper/test-read-midx.c | 31 ++-
t/lib-disjoint.sh | 49 +++++
t/perf/p5332-multi-pack-reuse.sh | 81 ++++++++
t/t5319-multi-pack-index.sh | 140 +++++++++++++
t/t5331-pack-objects-stdin.sh | 156 +++++++++++++++
t/t5332-multi-pack-reuse.sh | 219 ++++++++++++++++++++
t/t6113-rev-list-bitmap-filters.sh | 2 +
t/t7700-repack.sh | 4 +-
t/t7705-repack-extend-disjoint.sh | 142 +++++++++++++
t/test-lib-functions.sh | 14 ++
25 files changed, 1650 insertions(+), 144 deletions(-)
create mode 100644 t/lib-disjoint.sh
create mode 100755 t/perf/p5332-multi-pack-reuse.sh
create mode 100755 t/t5332-multi-pack-reuse.sh
create mode 100755 t/t7705-repack-extend-disjoint.sh
base-commit: 564d0252ca632e0264ed670534a51d18a689ef5d
--
2.43.0.24.g980b318f98
next reply other threads:[~2023-11-28 19:07 UTC|newest]
Thread overview: 107+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-11-28 19:07 Taylor Blau [this message]
2023-11-28 19:07 ` [PATCH 01/24] pack-objects: free packing_data in more places Taylor Blau
2023-11-30 10:18 ` Patrick Steinhardt
2023-11-30 19:08 ` Taylor Blau
2023-11-28 19:07 ` [PATCH 02/24] pack-bitmap-write: deep-clear the `bb_commit` slab Taylor Blau
2023-11-30 10:18 ` Patrick Steinhardt
2023-11-30 19:11 ` Taylor Blau
2023-12-12 7:04 ` Jeff King
2023-12-12 16:48 ` Taylor Blau
2023-11-28 19:08 ` [PATCH 03/24] pack-bitmap: plug leak in find_objects() Taylor Blau
2023-12-12 7:04 ` Jeff King
2023-11-28 19:08 ` [PATCH 04/24] midx: factor out `fill_pack_info()` Taylor Blau
2023-11-30 10:18 ` Patrick Steinhardt
2023-11-30 19:19 ` Taylor Blau
2023-11-28 19:08 ` [PATCH 05/24] midx: implement `DISP` chunk Taylor Blau
2023-11-30 10:18 ` Patrick Steinhardt
2023-11-30 19:27 ` Taylor Blau
2023-12-03 13:15 ` Junio C Hamano
2023-12-05 19:26 ` Taylor Blau
2023-12-09 1:40 ` Junio C Hamano
2023-12-09 2:30 ` Taylor Blau
2023-12-12 8:03 ` Jeff King
2023-12-13 18:28 ` Taylor Blau
2023-12-13 19:20 ` Junio C Hamano
2023-11-28 19:08 ` [PATCH 06/24] midx: implement `midx_locate_pack()` Taylor Blau
2023-11-28 19:08 ` [PATCH 07/24] midx: implement `--retain-disjoint` mode Taylor Blau
2023-11-30 10:18 ` Patrick Steinhardt
2023-11-30 19:29 ` Taylor Blau
2023-12-01 8:02 ` Patrick Steinhardt
2023-11-28 19:08 ` [PATCH 08/24] pack-objects: implement `--ignore-disjoint` mode Taylor Blau
2023-11-30 10:18 ` Patrick Steinhardt
2023-11-30 19:32 ` Taylor Blau
2023-12-01 8:17 ` Patrick Steinhardt
2023-12-01 19:58 ` Taylor Blau
2023-11-28 19:08 ` [PATCH 09/24] repack: implement `--extend-disjoint` mode Taylor Blau
2023-12-07 13:13 ` Patrick Steinhardt
2023-12-07 20:28 ` Taylor Blau
2023-12-08 8:19 ` Patrick Steinhardt
2023-12-08 22:48 ` Taylor Blau
2023-12-11 8:18 ` Patrick Steinhardt
2023-12-11 19:59 ` Taylor Blau
2023-11-28 19:08 ` [PATCH 10/24] pack-bitmap: pass `bitmapped_pack` struct to pack-reuse functions Taylor Blau
2023-12-07 13:13 ` Patrick Steinhardt
2023-12-07 20:34 ` Taylor Blau
2023-12-08 8:19 ` Patrick Steinhardt
2023-11-28 19:08 ` [PATCH 11/24] pack-bitmap: simplify `reuse_partial_packfile_from_bitmap()` signature Taylor Blau
2023-12-07 13:13 ` Patrick Steinhardt
2023-12-07 14:36 ` Taylor Blau
2023-11-28 19:08 ` [PATCH 12/24] pack-bitmap: return multiple packs via `reuse_partial_packfile_from_bitmap()` Taylor Blau
2023-11-28 19:08 ` [PATCH 13/24] pack-objects: parameterize pack-reuse routines over a single pack Taylor Blau
2023-11-28 19:08 ` [PATCH 14/24] pack-objects: keep track of `pack_start` for each reuse pack Taylor Blau
2023-12-07 13:13 ` Patrick Steinhardt
2023-12-07 20:43 ` Taylor Blau
2023-11-28 19:08 ` [PATCH 15/24] pack-objects: pass `bitmapped_pack`'s to pack-reuse functions Taylor Blau
2023-11-28 19:08 ` [PATCH 16/24] pack-objects: prepare `write_reused_pack()` for multi-pack reuse Taylor Blau
2023-12-07 13:13 ` Patrick Steinhardt
2023-12-07 20:47 ` Taylor Blau
2023-11-28 19:08 ` [PATCH 17/24] pack-objects: prepare `write_reused_pack_verbatim()` " Taylor Blau
2023-11-28 19:08 ` [PATCH 18/24] pack-objects: include number of packs reused in output Taylor Blau
2023-11-28 19:08 ` [PATCH 19/24] pack-bitmap: prepare to mark objects from multiple packs for reuse Taylor Blau
2023-11-28 19:08 ` [PATCH 20/24] pack-objects: add tracing for various packfile metrics Taylor Blau
2023-11-28 19:08 ` [PATCH 21/24] t/test-lib-functions.sh: implement `test_trace2_data` helper Taylor Blau
2023-11-28 19:08 ` [PATCH 22/24] pack-objects: allow setting `pack.allowPackReuse` to "single" Taylor Blau
2023-11-28 19:08 ` [PATCH 23/24] pack-bitmap: reuse objects from all disjoint packs Taylor Blau
2023-11-28 19:08 ` [PATCH 24/24] t/perf: add performance tests for multi-pack reuse Taylor Blau
2023-11-30 10:18 ` [PATCH 00/24] pack-objects: multi-pack verbatim reuse Patrick Steinhardt
2023-11-30 19:39 ` Taylor Blau
2023-12-01 8:31 ` Patrick Steinhardt
2023-12-01 20:02 ` Taylor Blau
2023-12-04 8:49 ` Patrick Steinhardt
2023-12-12 8:12 ` Jeff King
2023-12-15 15:37 ` Taylor Blau
2023-12-21 11:13 ` Jeff King
2024-01-04 22:22 ` Taylor Blau
2023-12-14 22:23 ` [PATCH v2 00/26] " Taylor Blau
2023-12-14 22:23 ` [PATCH v2 01/26] pack-objects: free packing_data in more places Taylor Blau
2023-12-14 22:23 ` [PATCH v2 02/26] pack-bitmap-write: deep-clear the `bb_commit` slab Taylor Blau
2023-12-14 22:23 ` [PATCH v2 03/26] pack-bitmap: plug leak in find_objects() Taylor Blau
2023-12-14 22:23 ` [PATCH v2 04/26] midx: factor out `fill_pack_info()` Taylor Blau
2023-12-14 22:23 ` [PATCH v2 05/26] midx: implement `BTMP` chunk Taylor Blau
2023-12-14 22:23 ` [PATCH v2 06/26] midx: implement `midx_locate_pack()` Taylor Blau
2023-12-14 22:23 ` [PATCH v2 07/26] pack-bitmap: pass `bitmapped_pack` struct to pack-reuse functions Taylor Blau
2023-12-14 22:23 ` [PATCH v2 08/26] ewah: implement `bitmap_is_empty()` Taylor Blau
2023-12-14 22:24 ` [PATCH v2 09/26] pack-bitmap: simplify `reuse_partial_packfile_from_bitmap()` signature Taylor Blau
2023-12-14 22:24 ` [PATCH v2 10/26] pack-bitmap: return multiple packs via `reuse_partial_packfile_from_bitmap()` Taylor Blau
2023-12-14 22:24 ` [PATCH v2 11/26] pack-objects: parameterize pack-reuse routines over a single pack Taylor Blau
2023-12-14 22:24 ` [PATCH v2 12/26] pack-objects: keep track of `pack_start` for each reuse pack Taylor Blau
2023-12-14 22:24 ` [PATCH v2 13/26] pack-objects: pass `bitmapped_pack`'s to pack-reuse functions Taylor Blau
2023-12-14 22:24 ` [PATCH v2 14/26] pack-objects: prepare `write_reused_pack()` for multi-pack reuse Taylor Blau
2023-12-14 22:24 ` [PATCH v2 15/26] pack-objects: prepare `write_reused_pack_verbatim()` " Taylor Blau
2023-12-14 22:24 ` [PATCH v2 16/26] pack-objects: include number of packs reused in output Taylor Blau
2023-12-14 22:24 ` [PATCH v2 17/26] git-compat-util.h: implement checked size_t to uint32_t conversion Taylor Blau
2023-12-14 22:24 ` [PATCH v2 18/26] midx: implement `midx_preferred_pack()` Taylor Blau
2023-12-14 22:24 ` [PATCH v2 19/26] pack-revindex: factor out `midx_key_to_pack_pos()` helper Taylor Blau
2023-12-14 22:24 ` [PATCH v2 20/26] pack-revindex: implement `midx_pair_to_pack_pos()` Taylor Blau
2023-12-14 22:24 ` [PATCH v2 21/26] pack-bitmap: prepare to mark objects from multiple packs for reuse Taylor Blau
2023-12-14 22:24 ` [PATCH v2 22/26] pack-objects: add tracing for various packfile metrics Taylor Blau
2023-12-14 22:24 ` [PATCH v2 23/26] t/test-lib-functions.sh: implement `test_trace2_data` helper Taylor Blau
2023-12-14 22:24 ` [PATCH v2 24/26] pack-objects: allow setting `pack.allowPackReuse` to "single" Taylor Blau
2023-12-14 22:24 ` [PATCH v2 25/26] pack-bitmap: enable reuse from all bitmapped packs Taylor Blau
2023-12-14 22:24 ` [PATCH v2 26/26] t/perf: add performance tests for multi-pack reuse Taylor Blau
2023-12-15 0:06 ` [PATCH v2 00/26] pack-objects: multi-pack verbatim reuse Junio C Hamano
2023-12-15 0:38 ` Taylor Blau
2023-12-15 0:40 ` Junio C Hamano
2023-12-15 1:25 ` Taylor Blau
2023-12-21 10:51 ` Jeff King
2024-01-04 22:24 ` Taylor Blau
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=cover.1701198172.git.me@ttaylorr.com \
--to=me@ttaylorr.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=peff@peff.net \
--cc=ps@pks.im \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).