git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Patrick Steinhardt <ps@pks.im>
To: Taylor Blau <me@ttaylorr.com>
Cc: git@vger.kernel.org, Jeff King <peff@peff.net>,
	Junio C Hamano <gitster@pobox.com>
Subject: Re: [PATCH 00/24] pack-objects: multi-pack verbatim reuse
Date: Thu, 30 Nov 2023 11:18:19 +0100	[thread overview]
Message-ID: <ZWhha2h2zzZWCXrw@tanuki> (raw)
In-Reply-To: <cover.1701198172.git.me@ttaylorr.com>

[-- Attachment #1: Type: text/plain, Size: 10142 bytes --]

On Tue, Nov 28, 2023 at 02:07:54PM -0500, Taylor Blau wrote:
> Back in fff42755ef (pack-bitmap: add support for bitmap indexes,
> 2013-12-21), we added support for reachability bitmaps, and taught
> pack-objects how to reuse verbatim chunks from the bitmapped pack. When
> multi-pack bitmaps were introduced, this pack-reuse mechanism evolved to
> use the MIDX's "preferred" pack as the source for verbatim reuse.
> 
> This allows repositories to incrementally repack themselves (e.g., using
> a `--geometric` repack), storing the result in a MIDX, and generating a
> corresponding bitmap. This keeps our bitmap coverage up-to-date, while
> maintaining a relatively small number of packs.
> 
> However, it is recommended (and matches what we do in production at
> GitHub) that repositories repack themselves all-into-one, and
> generate a corresponding single-pack reachability bitmap. This is done
> for a couple of reasons, but the most relevant one to this series is
> that it enables us to perform verbatim pack-reuse over a complete copy
> of the repository, since the entire repository resides in a single pack
> (and thus is eligible for verbatim pack-reuse).
> 
> As repositories grow larger, packing their contents into a single pack
> becomes less feasible. This series extends the pack-reuse mechanism to
> operate over multiple packs which are known ahead of time to be disjoint
> with respect to one another's set of objects.
> 
> The implementation has a few components:
> 
>   - A new MIDX chunk called "Disjoint packfiles" or DISP is introduced
>     to keep track of the bitmap position, number of objects, and
>     disjointed-ness for each pack contained in the MIDX.
> 
>   - A new mode for `git multi-pack-index write --stdin-packs` that
>     allows specifying disjoint packs, as well as a new option
>     `--retain-disjoint` which preserves the set of existing disjoint
>     packs in the new MIDX.
> 
>   - A new pack-objects mode `--ignore-disjoint`, which produces packs
>     which are disjoint with respect to the current set of disjoint packs
>     (i.e. it discards any objects from the packing list which appear in
>     any of the known-disjoint packs).
> 
>   - A new repack mode, `--extend-disjoint` which causes any new pack(s)
>     which are generated to be disjoint with respect to the set of packs
>     currently marked as disjoint, minus any pack(s) which are about to
>     be deleted.
> 
> With all of that in place, the patch series then rewrites all of the
> pack-reuse functions in terms of the new `bitmapped_pack` structure.
> Once we have dropped all of the assumptions stemming from only
> performing pack-reuse over a single candidate pack, we can then enable
> reuse over all of the disjoint packs.
> 
> In addition to the many new tests in t5332 added by that series, I tried
> to simulate a "real world" test on git.git by breaking the repository
> into chunks of 1,000 commits (plus their set of reachable objects not
> reachable from earlier chunk(s)) and packing those chunks. This produces
> a large number of packs with the objects from git.git which are known to
> be disjoint with respect to one another.
> 
>     $ git clone git@github.com:git/git.git base
> 
>     $ cd base
>     $ mv .git/objects/pack/pack-*.idx{,.bak}
>     $ git unpack-objects <.git/objects/pack/pack-*.pack
> 
>     # pack the objects from each successive block of 1k commits
>     $ for rev in $(git rev-list --all | awk '(NR) % 1000 == 0' | tac)
>       do
>         echo $rev |
>         git.compile pack-objects --revs --unpacked .git/objects/pack/pack || return 1
>       done
>     # and grab any stragglers, pruning the unpacked objects
>     $ git repack -d
>     I then constructed a MIDX and corresponding bitmap
> 
>     $ find_pack () {
>         for idx in .git/objects/pack/pack-*.idx
>         do
>           git show-index <$idx | grep -q "$1" && basename $idx
>         done
>       }
>     $ preferred="$(find_pack $(git rev-parse HEAD))"
> 
>     $ ( cd .git/objects/pack && ls -1 *.idx ) | sed -e 's/^/+/g' |
>         git.compile multi-pack-index write --bitmap --stdin-packs \
>           --preferred-pack=$preferred
>     $ git for-each-ref --format='%(objectname)' refs/heads refs/tags >in
> 
> With all of that in place, I was able to produce a significant speed-up
> by reusing objects from multiple packs:
> 
>     $ hyperfine -L v single,multi -n '{v}-pack reuse' 'git.compile -c pack.allowPackReuse={v} pack-objects --revs --stdout --use-bitmap-index --delta-base-offset <in >/dev/null'
>     Benchmark 1: single-pack reuse
>       Time (mean ± σ):      6.094 s ±  0.023 s    [User: 43.723 s, System: 0.358 s]
>       Range (min … max):    6.063 s …  6.126 s    10 runs
> 
>     Benchmark 2: multi-pack reuse
>       Time (mean ± σ):     906.5 ms ±   3.2 ms    [User: 1081.5 ms, System: 30.9 ms]
>       Range (min … max):   903.5 ms … 912.7 ms    10 runs
> 
>     Summary
>       multi-pack reuse ran
>         6.72 ± 0.03 times faster than single-pack reuse
> 
> (There are corresponding tests in p5332 that test different sized chunks
> and measure the runtime performance as well as resulting pack size).
> 
> Performing verbatim pack reuse naturally trades off between CPU time and
> the resulting pack size. In the above example, the single-pack reuse
> case produces a clone size of ~194 MB on my machine, while the
> multi-pack reuse case produces a clone size closer to ~266 MB, which is
> a ~37% increase in clone size.

Quite exciting, and a tradeoff that may be worth it for Git hosters. I
expect that this is going to be an extreme example of the benefits
provided by your patch series -- do you by any chance also have "real"
numbers that make it possible to quantify the effect a bit better?

No worry if you don't, I'm just curious.

> I think there is still some opportunity to close this gap, since the
> "packing" strategy here is extremely naive. In a production setting, I'm
> sure that there are more well thought out repacking strategies that
> would produce more similar clone sizes.
> 
> I considered breaking this series up into smaller chunks, but was
> unsatisfied with the result. Since this series is rather large, if you
> have alternate suggestions on better ways to structure this, please let
> me know.

The series is indeed very involved to review. I only made it up to patch
8/24 and already spent quite some time on it. So I'd certainly welcome
it if this was split up into smaller parts, but don't have a suggestion
as to how this should be done (also because I didn't yet read the other
16 patches).

I'll review the remaining patches at a later point in time.

Patrick

> Thanks in advance for your review!
> 
> Taylor Blau (24):
>   pack-objects: free packing_data in more places
>   pack-bitmap-write: deep-clear the `bb_commit` slab
>   pack-bitmap: plug leak in find_objects()
>   midx: factor out `fill_pack_info()`
>   midx: implement `DISP` chunk
>   midx: implement `midx_locate_pack()`
>   midx: implement `--retain-disjoint` mode
>   pack-objects: implement `--ignore-disjoint` mode
>   repack: implement `--extend-disjoint` mode
>   pack-bitmap: pass `bitmapped_pack` struct to pack-reuse functions
>   pack-bitmap: simplify `reuse_partial_packfile_from_bitmap()` signature
>   pack-bitmap: return multiple packs via
>     `reuse_partial_packfile_from_bitmap()`
>   pack-objects: parameterize pack-reuse routines over a single pack
>   pack-objects: keep track of `pack_start` for each reuse pack
>   pack-objects: pass `bitmapped_pack`'s to pack-reuse functions
>   pack-objects: prepare `write_reused_pack()` for multi-pack reuse
>   pack-objects: prepare `write_reused_pack_verbatim()` for multi-pack
>     reuse
>   pack-objects: include number of packs reused in output
>   pack-bitmap: prepare to mark objects from multiple packs for reuse
>   pack-objects: add tracing for various packfile metrics
>   t/test-lib-functions.sh: implement `test_trace2_data` helper
>   pack-objects: allow setting `pack.allowPackReuse` to "single"
>   pack-bitmap: reuse objects from all disjoint packs
>   t/perf: add performance tests for multi-pack reuse
> 
>  Documentation/config/pack.txt          |   8 +-
>  Documentation/git-multi-pack-index.txt |  12 ++
>  Documentation/git-pack-objects.txt     |   8 +
>  Documentation/git-repack.txt           |  12 ++
>  Documentation/gitformat-pack.txt       | 109 ++++++++++
>  builtin/multi-pack-index.c             |  13 +-
>  builtin/pack-objects.c                 | 200 +++++++++++++++----
>  builtin/repack.c                       |  57 +++++-
>  midx.c                                 | 218 +++++++++++++++++---
>  midx.h                                 |  11 +-
>  pack-bitmap-write.c                    |   9 +-
>  pack-bitmap.c                          | 265 ++++++++++++++++++++-----
>  pack-bitmap.h                          |  18 +-
>  pack-objects.c                         |  15 ++
>  pack-objects.h                         |   1 +
>  t/helper/test-read-midx.c              |  31 ++-
>  t/lib-disjoint.sh                      |  49 +++++
>  t/perf/p5332-multi-pack-reuse.sh       |  81 ++++++++
>  t/t5319-multi-pack-index.sh            | 140 +++++++++++++
>  t/t5331-pack-objects-stdin.sh          | 156 +++++++++++++++
>  t/t5332-multi-pack-reuse.sh            | 219 ++++++++++++++++++++
>  t/t6113-rev-list-bitmap-filters.sh     |   2 +
>  t/t7700-repack.sh                      |   4 +-
>  t/t7705-repack-extend-disjoint.sh      | 142 +++++++++++++
>  t/test-lib-functions.sh                |  14 ++
>  25 files changed, 1650 insertions(+), 144 deletions(-)
>  create mode 100644 t/lib-disjoint.sh
>  create mode 100755 t/perf/p5332-multi-pack-reuse.sh
>  create mode 100755 t/t5332-multi-pack-reuse.sh
>  create mode 100755 t/t7705-repack-extend-disjoint.sh
> 
> 
> base-commit: 564d0252ca632e0264ed670534a51d18a689ef5d
> -- 
> 2.43.0.24.g980b318f98

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

  parent reply	other threads:[~2023-11-30 10:18 UTC|newest]

Thread overview: 107+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-11-28 19:07 [PATCH 00/24] pack-objects: multi-pack verbatim reuse Taylor Blau
2023-11-28 19:07 ` [PATCH 01/24] pack-objects: free packing_data in more places Taylor Blau
2023-11-30 10:18   ` Patrick Steinhardt
2023-11-30 19:08     ` Taylor Blau
2023-11-28 19:07 ` [PATCH 02/24] pack-bitmap-write: deep-clear the `bb_commit` slab Taylor Blau
2023-11-30 10:18   ` Patrick Steinhardt
2023-11-30 19:11     ` Taylor Blau
2023-12-12  7:04   ` Jeff King
2023-12-12 16:48     ` Taylor Blau
2023-11-28 19:08 ` [PATCH 03/24] pack-bitmap: plug leak in find_objects() Taylor Blau
2023-12-12  7:04   ` Jeff King
2023-11-28 19:08 ` [PATCH 04/24] midx: factor out `fill_pack_info()` Taylor Blau
2023-11-30 10:18   ` Patrick Steinhardt
2023-11-30 19:19     ` Taylor Blau
2023-11-28 19:08 ` [PATCH 05/24] midx: implement `DISP` chunk Taylor Blau
2023-11-30 10:18   ` Patrick Steinhardt
2023-11-30 19:27     ` Taylor Blau
2023-12-03 13:15   ` Junio C Hamano
2023-12-05 19:26     ` Taylor Blau
2023-12-09  1:40       ` Junio C Hamano
2023-12-09  2:30         ` Taylor Blau
2023-12-12  8:03           ` Jeff King
2023-12-13 18:28             ` Taylor Blau
2023-12-13 19:20               ` Junio C Hamano
2023-11-28 19:08 ` [PATCH 06/24] midx: implement `midx_locate_pack()` Taylor Blau
2023-11-28 19:08 ` [PATCH 07/24] midx: implement `--retain-disjoint` mode Taylor Blau
2023-11-30 10:18   ` Patrick Steinhardt
2023-11-30 19:29     ` Taylor Blau
2023-12-01  8:02       ` Patrick Steinhardt
2023-11-28 19:08 ` [PATCH 08/24] pack-objects: implement `--ignore-disjoint` mode Taylor Blau
2023-11-30 10:18   ` Patrick Steinhardt
2023-11-30 19:32     ` Taylor Blau
2023-12-01  8:17       ` Patrick Steinhardt
2023-12-01 19:58         ` Taylor Blau
2023-11-28 19:08 ` [PATCH 09/24] repack: implement `--extend-disjoint` mode Taylor Blau
2023-12-07 13:13   ` Patrick Steinhardt
2023-12-07 20:28     ` Taylor Blau
2023-12-08  8:19       ` Patrick Steinhardt
2023-12-08 22:48         ` Taylor Blau
2023-12-11  8:18           ` Patrick Steinhardt
2023-12-11 19:59             ` Taylor Blau
2023-11-28 19:08 ` [PATCH 10/24] pack-bitmap: pass `bitmapped_pack` struct to pack-reuse functions Taylor Blau
2023-12-07 13:13   ` Patrick Steinhardt
2023-12-07 20:34     ` Taylor Blau
2023-12-08  8:19       ` Patrick Steinhardt
2023-11-28 19:08 ` [PATCH 11/24] pack-bitmap: simplify `reuse_partial_packfile_from_bitmap()` signature Taylor Blau
2023-12-07 13:13   ` Patrick Steinhardt
2023-12-07 14:36     ` Taylor Blau
2023-11-28 19:08 ` [PATCH 12/24] pack-bitmap: return multiple packs via `reuse_partial_packfile_from_bitmap()` Taylor Blau
2023-11-28 19:08 ` [PATCH 13/24] pack-objects: parameterize pack-reuse routines over a single pack Taylor Blau
2023-11-28 19:08 ` [PATCH 14/24] pack-objects: keep track of `pack_start` for each reuse pack Taylor Blau
2023-12-07 13:13   ` Patrick Steinhardt
2023-12-07 20:43     ` Taylor Blau
2023-11-28 19:08 ` [PATCH 15/24] pack-objects: pass `bitmapped_pack`'s to pack-reuse functions Taylor Blau
2023-11-28 19:08 ` [PATCH 16/24] pack-objects: prepare `write_reused_pack()` for multi-pack reuse Taylor Blau
2023-12-07 13:13   ` Patrick Steinhardt
2023-12-07 20:47     ` Taylor Blau
2023-11-28 19:08 ` [PATCH 17/24] pack-objects: prepare `write_reused_pack_verbatim()` " Taylor Blau
2023-11-28 19:08 ` [PATCH 18/24] pack-objects: include number of packs reused in output Taylor Blau
2023-11-28 19:08 ` [PATCH 19/24] pack-bitmap: prepare to mark objects from multiple packs for reuse Taylor Blau
2023-11-28 19:08 ` [PATCH 20/24] pack-objects: add tracing for various packfile metrics Taylor Blau
2023-11-28 19:08 ` [PATCH 21/24] t/test-lib-functions.sh: implement `test_trace2_data` helper Taylor Blau
2023-11-28 19:08 ` [PATCH 22/24] pack-objects: allow setting `pack.allowPackReuse` to "single" Taylor Blau
2023-11-28 19:08 ` [PATCH 23/24] pack-bitmap: reuse objects from all disjoint packs Taylor Blau
2023-11-28 19:08 ` [PATCH 24/24] t/perf: add performance tests for multi-pack reuse Taylor Blau
2023-11-30 10:18 ` Patrick Steinhardt [this message]
2023-11-30 19:39   ` [PATCH 00/24] pack-objects: multi-pack verbatim reuse Taylor Blau
2023-12-01  8:31     ` Patrick Steinhardt
2023-12-01 20:02       ` Taylor Blau
2023-12-04  8:49         ` Patrick Steinhardt
2023-12-12  8:12 ` Jeff King
2023-12-15 15:37   ` Taylor Blau
2023-12-21 11:13     ` Jeff King
2024-01-04 22:22       ` Taylor Blau
2023-12-14 22:23 ` [PATCH v2 00/26] " Taylor Blau
2023-12-14 22:23   ` [PATCH v2 01/26] pack-objects: free packing_data in more places Taylor Blau
2023-12-14 22:23   ` [PATCH v2 02/26] pack-bitmap-write: deep-clear the `bb_commit` slab Taylor Blau
2023-12-14 22:23   ` [PATCH v2 03/26] pack-bitmap: plug leak in find_objects() Taylor Blau
2023-12-14 22:23   ` [PATCH v2 04/26] midx: factor out `fill_pack_info()` Taylor Blau
2023-12-14 22:23   ` [PATCH v2 05/26] midx: implement `BTMP` chunk Taylor Blau
2023-12-14 22:23   ` [PATCH v2 06/26] midx: implement `midx_locate_pack()` Taylor Blau
2023-12-14 22:23   ` [PATCH v2 07/26] pack-bitmap: pass `bitmapped_pack` struct to pack-reuse functions Taylor Blau
2023-12-14 22:23   ` [PATCH v2 08/26] ewah: implement `bitmap_is_empty()` Taylor Blau
2023-12-14 22:24   ` [PATCH v2 09/26] pack-bitmap: simplify `reuse_partial_packfile_from_bitmap()` signature Taylor Blau
2023-12-14 22:24   ` [PATCH v2 10/26] pack-bitmap: return multiple packs via `reuse_partial_packfile_from_bitmap()` Taylor Blau
2023-12-14 22:24   ` [PATCH v2 11/26] pack-objects: parameterize pack-reuse routines over a single pack Taylor Blau
2023-12-14 22:24   ` [PATCH v2 12/26] pack-objects: keep track of `pack_start` for each reuse pack Taylor Blau
2023-12-14 22:24   ` [PATCH v2 13/26] pack-objects: pass `bitmapped_pack`'s to pack-reuse functions Taylor Blau
2023-12-14 22:24   ` [PATCH v2 14/26] pack-objects: prepare `write_reused_pack()` for multi-pack reuse Taylor Blau
2023-12-14 22:24   ` [PATCH v2 15/26] pack-objects: prepare `write_reused_pack_verbatim()` " Taylor Blau
2023-12-14 22:24   ` [PATCH v2 16/26] pack-objects: include number of packs reused in output Taylor Blau
2023-12-14 22:24   ` [PATCH v2 17/26] git-compat-util.h: implement checked size_t to uint32_t conversion Taylor Blau
2023-12-14 22:24   ` [PATCH v2 18/26] midx: implement `midx_preferred_pack()` Taylor Blau
2023-12-14 22:24   ` [PATCH v2 19/26] pack-revindex: factor out `midx_key_to_pack_pos()` helper Taylor Blau
2023-12-14 22:24   ` [PATCH v2 20/26] pack-revindex: implement `midx_pair_to_pack_pos()` Taylor Blau
2023-12-14 22:24   ` [PATCH v2 21/26] pack-bitmap: prepare to mark objects from multiple packs for reuse Taylor Blau
2023-12-14 22:24   ` [PATCH v2 22/26] pack-objects: add tracing for various packfile metrics Taylor Blau
2023-12-14 22:24   ` [PATCH v2 23/26] t/test-lib-functions.sh: implement `test_trace2_data` helper Taylor Blau
2023-12-14 22:24   ` [PATCH v2 24/26] pack-objects: allow setting `pack.allowPackReuse` to "single" Taylor Blau
2023-12-14 22:24   ` [PATCH v2 25/26] pack-bitmap: enable reuse from all bitmapped packs Taylor Blau
2023-12-14 22:24   ` [PATCH v2 26/26] t/perf: add performance tests for multi-pack reuse Taylor Blau
2023-12-15  0:06   ` [PATCH v2 00/26] pack-objects: multi-pack verbatim reuse Junio C Hamano
2023-12-15  0:38     ` Taylor Blau
2023-12-15  0:40     ` Junio C Hamano
2023-12-15  1:25       ` Taylor Blau
2023-12-21 10:51         ` Jeff King
2024-01-04 22:24           ` Taylor Blau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZWhha2h2zzZWCXrw@tanuki \
    --to=ps@pks.im \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=me@ttaylorr.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).