From: Patrick Steinhardt <ps@pks.im>
To: Taylor Blau <me@ttaylorr.com>
Cc: git@vger.kernel.org, Jeff King <peff@peff.net>,
Junio C Hamano <gitster@pobox.com>
Subject: Re: [PATCH 07/24] midx: implement `--retain-disjoint` mode
Date: Thu, 30 Nov 2023 11:18:51 +0100 [thread overview]
Message-ID: <ZWhhi15VpeCRflDB@tanuki> (raw)
In-Reply-To: <3019738b52ba8cd78ea696a3b800fa91e722eb66.1701198172.git.me@ttaylorr.com>
[-- Attachment #1: Type: text/plain, Size: 11318 bytes --]
On Tue, Nov 28, 2023 at 02:08:13PM -0500, Taylor Blau wrote:
> Once multi-pack reachability bitmaps learn how to perform pack reuse
> over the set of disjoint packs, we will want to teach `git repack` to
> evolve the set of disjoint packs over time.
>
> To evolve the set of disjoint packs means any new packs made by `repack`
> should be disjoint with respect to the existing set of disjoint packs so
> as to be able to join that set when updating the multi-pack index.
>
> The details of generating such packs will be left to future commits. But
> any new pack(s) created by repack as disjoint will be marked as such by
> passing them over `--stdin-packs` with the special '+' marker when
> generating a new MIDX.
>
> This patch, however, addresses the question of how we retain the
> existing set of disjoint packs when updating the multi-pack index. One
> option would be for `repack` to keep track of the set of disjoint packs
> itself by querying the MIDX, and then adding the special '+' marker
> appropriately when generating the input for `--stdin-packs`.
>
> But this is verbose and error-prone, since two different parts of Git
> would need to maintain the same notion of the set of disjoint packs.
> When one disagrees with the other, the set of so-called disjoint packs
> may actually contain two or more packs which have one or more object(s)
> in common, making the set non-disjoint.
>
> Instead, introduce a `--retain-disjoint` mode for the `git
> multi-pack-index write` sub-command which keeps any packs which are:
>
> - marked as disjoint in the existing MIDX, and
>
> - not deleted (e.g., they are not excluded from the input for
> `--stdin-packs`).
>
> This will allow the `repack` command to not have to keep track of the
> set of currently-disjoint packs itself, reducing the number of lines of
> code necessary to implement this feature, and making the resulting
> implementation less error-prone.
>
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
> Documentation/git-multi-pack-index.txt | 8 +++
> builtin/multi-pack-index.c | 3 +
> midx.c | 49 +++++++++++++++
> midx.h | 1 +
> t/lib-disjoint.sh | 38 ++++++++++++
> t/t5319-multi-pack-index.sh | 82 ++++++++++++++++++++++++++
> 6 files changed, 181 insertions(+)
> create mode 100644 t/lib-disjoint.sh
>
> diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
> index d130e65b28..ac0c7b124b 100644
> --- a/Documentation/git-multi-pack-index.txt
> +++ b/Documentation/git-multi-pack-index.txt
> @@ -54,6 +54,14 @@ write::
> "disjoint". See the "`DISP` chunk and disjoint packs"
> section in linkgit:gitformat-pack[5] for more.
>
> + --retain-disjoint::
> + When writing a multi-pack index with a reachability
> + bitmap, keep any packs marked as disjoint in the
> + existing MIDX (if any) as such in the new MIDX. Existing
> + disjoint packs which are removed (e.g., not listed via
> + `--stdin-packs`) are ignored. This option works in
> + addition to the '+' marker for `--stdin-packs`.
I'm trying to understand when you're expected to pass this flag and when
you're expected not to pass it. This documentation could also help in
the documentation here so that the user can make a more informed
decision.
Patrick
> --refs-snapshot=<path>::
> With `--bitmap`, optionally specify a file which
> contains a "refs snapshot" taken prior to repacking.
> diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
> index 0f1dd4651d..dcfabf2626 100644
> --- a/builtin/multi-pack-index.c
> +++ b/builtin/multi-pack-index.c
> @@ -138,6 +138,9 @@ static int cmd_multi_pack_index_write(int argc, const char **argv,
> N_("write multi-pack index containing only given indexes")),
> OPT_FILENAME(0, "refs-snapshot", &opts.refs_snapshot,
> N_("refs snapshot for selecting bitmap commits")),
> + OPT_BIT(0, "retain-disjoint", &opts.flags,
> + N_("retain non-deleted disjoint packs"),
> + MIDX_WRITE_RETAIN_DISJOINT),
> OPT_END(),
> };
>
> diff --git a/midx.c b/midx.c
> index 65ba0c70fe..ce67da9f85 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -721,6 +721,12 @@ static void midx_fanout_add_midx_fanout(struct midx_fanout *fanout,
> &fanout->entries[fanout->nr],
> cur_object);
> fanout->entries[fanout->nr].preferred = 0;
> + /*
> + * It's OK to set disjoint to 0 here, even with
> + * `--retain-disjoint`, since we will always see the disjoint
> + * copy of some object below in get_sorted_entries(), causing us
> + * to die().
> + */
> fanout->entries[fanout->nr].disjoint = 0;
> fanout->nr++;
> }
> @@ -1362,6 +1368,37 @@ static struct multi_pack_index *lookup_multi_pack_index(struct repository *r,
> return result;
> }
>
> +static int midx_retain_existing_disjoint(struct repository *r,
> + struct multi_pack_index *from,
> + struct write_midx_context *ctx)
> +{
> + struct bitmapped_pack bp;
> + uint32_t i, midx_pos;
> +
> + for (i = 0; i < ctx->nr; i++) {
> + struct pack_info *info = &ctx->info[i];
> + /*
> + * Having to call `midx_locate_pack()` in a loop is
> + * sub-optimal, since it is O(n*log(n)) in the number
> + * of packs.
> + *
> + * When reusing an existing MIDX, we know that the first
> + * 'n' packs appear in the same order, so we could avoid
> + * this when reusing an existing MIDX. But we may be
> + * instead relying on the order given to us by
> + * for_each_file_in_pack_dir(), in which case we can't
> + * make any such guarantees.
> + */
> + if (!midx_locate_pack(from, info->pack_name, &midx_pos))
> + continue;
> +
> + if (nth_bitmapped_pack(r, from, &bp, midx_pos) < 0)
> + return -1;
> + info->disjoint = bp.disjoint;
> + }
> + return 0;
> +}
> +
> static int write_midx_internal(const char *object_dir,
> struct string_list *packs_to_include,
> struct string_list *packs_to_drop,
> @@ -1444,6 +1481,18 @@ static int write_midx_internal(const char *object_dir,
> for_each_file_in_pack_dir(object_dir, add_pack_to_midx, &ctx);
> stop_progress(&ctx.progress);
>
> + if (flags & MIDX_WRITE_RETAIN_DISJOINT) {
> + struct multi_pack_index *m = ctx.m;
> + if (!m)
> + m = lookup_multi_pack_index(the_repository, object_dir);
> +
> + if (m) {
> + result = midx_retain_existing_disjoint(the_repository, m, &ctx);
> + if (result)
> + goto cleanup;
> + }
> + }
> +
> if ((ctx.m && ctx.nr == ctx.m->num_packs) &&
> !(packs_to_include || packs_to_drop)) {
> struct bitmap_index *bitmap_git;
> diff --git a/midx.h b/midx.h
> index a6e969c2ea..d7ce52ff7b 100644
> --- a/midx.h
> +++ b/midx.h
> @@ -54,6 +54,7 @@ struct multi_pack_index {
> #define MIDX_WRITE_BITMAP (1 << 2)
> #define MIDX_WRITE_BITMAP_HASH_CACHE (1 << 3)
> #define MIDX_WRITE_BITMAP_LOOKUP_TABLE (1 << 4)
> +#define MIDX_WRITE_RETAIN_DISJOINT (1 << 5)
>
> const unsigned char *get_midx_checksum(struct multi_pack_index *m);
> void get_midx_filename(struct strbuf *out, const char *object_dir);
> diff --git a/t/lib-disjoint.sh b/t/lib-disjoint.sh
> new file mode 100644
> index 0000000000..c6c6e74aba
> --- /dev/null
> +++ b/t/lib-disjoint.sh
> @@ -0,0 +1,38 @@
> +# Helpers for scripts testing disjoint packs; see t5319 for example usage.
> +
> +objdir=.git/objects
> +
> +test_disjoint_1 () {
> + local pack="$1"
> + local want="$2"
> +
> + test-tool read-midx --bitmap $objdir >out &&
> + grep -A 3 "$pack" out >found &&
> +
> + if ! test -s found
> + then
> + echo >&2 "could not find '$pack' in MIDX"
> + return 1
> + fi
> +
> + if ! grep -q "disjoint: $want" found
> + then
> + echo >&2 "incorrect disjoint state for pack '$pack'"
> + return 1
> + fi
> + return 0
> +}
> +
> +# test_must_be_disjoint <pack-$XYZ.pack>
> +#
> +# Ensures that the given pack is marked as disjoint.
> +test_must_be_disjoint () {
> + test_disjoint_1 "$1" "yes"
> +}
> +
> +# test_must_not_be_disjoint <pack-$XYZ.pack>
> +#
> +# Ensures that the given pack is not marked as disjoint.
> +test_must_not_be_disjoint () {
> + test_disjoint_1 "$1" "no"
> +}
> diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
> index fd24e0c952..02cfddf151 100755
> --- a/t/t5319-multi-pack-index.sh
> +++ b/t/t5319-multi-pack-index.sh
> @@ -3,6 +3,7 @@
> test_description='multi-pack-indexes'
> . ./test-lib.sh
> . "$TEST_DIRECTORY"/lib-chunk.sh
> +. "$TEST_DIRECTORY"/lib-disjoint.sh
>
> GIT_TEST_MULTI_PACK_INDEX=0
> objdir=.git/objects
> @@ -1215,4 +1216,85 @@ test_expect_success 'non-disjoint packs are detected' '
> )
> '
>
> +test_expect_success 'retain disjoint packs while writing' '
> + test_when_finished "rm -fr repo" &&
> + git init repo &&
> + (
> + cd repo &&
> +
> + for i in 1 2
> + do
> + test_commit "$i" && git repack -d || return 1
> + done &&
> +
> + find $objdir/pack -type f -name "pack-*.idx" |
> + sed -e "s/^.*\/\(.*\)/\1/g" | sort >packs.old &&
> +
> + test_line_count = 2 packs.old &&
> + disjoint="$(head -n 1 packs.old)" &&
> + non_disjoint="$(tail -n 1 packs.old)" &&
> +
> + cat >in <<-EOF &&
> + +$disjoint
> + $non_disjoint
> + EOF
> + git multi-pack-index write --stdin-packs --bitmap <in &&
> +
> + test_must_be_disjoint "${disjoint%.idx}.pack" &&
> + test_must_not_be_disjoint "${non_disjoint%.idx}.pack" &&
> +
> + test_commit 3 &&
> + git repack -d &&
> +
> + find $objdir/pack -type f -name "pack-*.idx" |
> + sed -e "s/^.*\/\(.*\)/\1/g" | sort >packs.new &&
> +
> + new_disjoint="$(comm -13 packs.old packs.new)" &&
> + cat >in <<-EOF &&
> + $disjoint
> + $non_disjoint
> + +$new_disjoint
> + EOF
> + git multi-pack-index write --stdin-packs --bitmap \
> + --retain-disjoint <in &&
> +
> + test_must_be_disjoint "${disjoint%.idx}.pack" &&
> + test_must_be_disjoint "${new_disjoint%.idx}.pack" &&
> + test_must_not_be_disjoint "${non_disjoint%.idx}.pack"
> +
> + )
> +'
> +
> +test_expect_success 'non-disjoint packs are detected via --retain-disjoint' '
> + test_when_finished "rm -fr repo" &&
> + git init repo &&
> + (
> + cd repo &&
> + packdir=.git/objects/pack &&
> +
> + test_commit base &&
> + base="$(echo base | git pack-objects --revs $packdir/pack)" &&
> +
> + cat >in <<-EOF &&
> + +pack-$base.idx
> + EOF
> + git multi-pack-index write --stdin-packs --bitmap <in &&
> +
> + test_must_be_disjoint "pack-$base.pack" &&
> +
> + test_commit other &&
> + other="$(echo other | git pack-objects --revs $packdir/pack)" &&
> +
> + cat >in <<-EOF &&
> + pack-$base.idx
> + +pack-$other.idx
> + EOF
> + test_must_fail git multi-pack-index write --stdin-packs --retain-disjoint --bitmap <in 2>err &&
> + grep "duplicate object.* among disjoint packs" err &&
> +
> + test_must_fail git multi-pack-index write --retain-disjoint --bitmap 2>err &&
> + grep "duplicate object.* among disjoint packs" err
> + )
> +'
> +
> test_done
> --
> 2.43.0.24.g980b318f98
>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
next prev parent reply other threads:[~2023-11-30 10:18 UTC|newest]
Thread overview: 107+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-11-28 19:07 [PATCH 00/24] pack-objects: multi-pack verbatim reuse Taylor Blau
2023-11-28 19:07 ` [PATCH 01/24] pack-objects: free packing_data in more places Taylor Blau
2023-11-30 10:18 ` Patrick Steinhardt
2023-11-30 19:08 ` Taylor Blau
2023-11-28 19:07 ` [PATCH 02/24] pack-bitmap-write: deep-clear the `bb_commit` slab Taylor Blau
2023-11-30 10:18 ` Patrick Steinhardt
2023-11-30 19:11 ` Taylor Blau
2023-12-12 7:04 ` Jeff King
2023-12-12 16:48 ` Taylor Blau
2023-11-28 19:08 ` [PATCH 03/24] pack-bitmap: plug leak in find_objects() Taylor Blau
2023-12-12 7:04 ` Jeff King
2023-11-28 19:08 ` [PATCH 04/24] midx: factor out `fill_pack_info()` Taylor Blau
2023-11-30 10:18 ` Patrick Steinhardt
2023-11-30 19:19 ` Taylor Blau
2023-11-28 19:08 ` [PATCH 05/24] midx: implement `DISP` chunk Taylor Blau
2023-11-30 10:18 ` Patrick Steinhardt
2023-11-30 19:27 ` Taylor Blau
2023-12-03 13:15 ` Junio C Hamano
2023-12-05 19:26 ` Taylor Blau
2023-12-09 1:40 ` Junio C Hamano
2023-12-09 2:30 ` Taylor Blau
2023-12-12 8:03 ` Jeff King
2023-12-13 18:28 ` Taylor Blau
2023-12-13 19:20 ` Junio C Hamano
2023-11-28 19:08 ` [PATCH 06/24] midx: implement `midx_locate_pack()` Taylor Blau
2023-11-28 19:08 ` [PATCH 07/24] midx: implement `--retain-disjoint` mode Taylor Blau
2023-11-30 10:18 ` Patrick Steinhardt [this message]
2023-11-30 19:29 ` Taylor Blau
2023-12-01 8:02 ` Patrick Steinhardt
2023-11-28 19:08 ` [PATCH 08/24] pack-objects: implement `--ignore-disjoint` mode Taylor Blau
2023-11-30 10:18 ` Patrick Steinhardt
2023-11-30 19:32 ` Taylor Blau
2023-12-01 8:17 ` Patrick Steinhardt
2023-12-01 19:58 ` Taylor Blau
2023-11-28 19:08 ` [PATCH 09/24] repack: implement `--extend-disjoint` mode Taylor Blau
2023-12-07 13:13 ` Patrick Steinhardt
2023-12-07 20:28 ` Taylor Blau
2023-12-08 8:19 ` Patrick Steinhardt
2023-12-08 22:48 ` Taylor Blau
2023-12-11 8:18 ` Patrick Steinhardt
2023-12-11 19:59 ` Taylor Blau
2023-11-28 19:08 ` [PATCH 10/24] pack-bitmap: pass `bitmapped_pack` struct to pack-reuse functions Taylor Blau
2023-12-07 13:13 ` Patrick Steinhardt
2023-12-07 20:34 ` Taylor Blau
2023-12-08 8:19 ` Patrick Steinhardt
2023-11-28 19:08 ` [PATCH 11/24] pack-bitmap: simplify `reuse_partial_packfile_from_bitmap()` signature Taylor Blau
2023-12-07 13:13 ` Patrick Steinhardt
2023-12-07 14:36 ` Taylor Blau
2023-11-28 19:08 ` [PATCH 12/24] pack-bitmap: return multiple packs via `reuse_partial_packfile_from_bitmap()` Taylor Blau
2023-11-28 19:08 ` [PATCH 13/24] pack-objects: parameterize pack-reuse routines over a single pack Taylor Blau
2023-11-28 19:08 ` [PATCH 14/24] pack-objects: keep track of `pack_start` for each reuse pack Taylor Blau
2023-12-07 13:13 ` Patrick Steinhardt
2023-12-07 20:43 ` Taylor Blau
2023-11-28 19:08 ` [PATCH 15/24] pack-objects: pass `bitmapped_pack`'s to pack-reuse functions Taylor Blau
2023-11-28 19:08 ` [PATCH 16/24] pack-objects: prepare `write_reused_pack()` for multi-pack reuse Taylor Blau
2023-12-07 13:13 ` Patrick Steinhardt
2023-12-07 20:47 ` Taylor Blau
2023-11-28 19:08 ` [PATCH 17/24] pack-objects: prepare `write_reused_pack_verbatim()` " Taylor Blau
2023-11-28 19:08 ` [PATCH 18/24] pack-objects: include number of packs reused in output Taylor Blau
2023-11-28 19:08 ` [PATCH 19/24] pack-bitmap: prepare to mark objects from multiple packs for reuse Taylor Blau
2023-11-28 19:08 ` [PATCH 20/24] pack-objects: add tracing for various packfile metrics Taylor Blau
2023-11-28 19:08 ` [PATCH 21/24] t/test-lib-functions.sh: implement `test_trace2_data` helper Taylor Blau
2023-11-28 19:08 ` [PATCH 22/24] pack-objects: allow setting `pack.allowPackReuse` to "single" Taylor Blau
2023-11-28 19:08 ` [PATCH 23/24] pack-bitmap: reuse objects from all disjoint packs Taylor Blau
2023-11-28 19:08 ` [PATCH 24/24] t/perf: add performance tests for multi-pack reuse Taylor Blau
2023-11-30 10:18 ` [PATCH 00/24] pack-objects: multi-pack verbatim reuse Patrick Steinhardt
2023-11-30 19:39 ` Taylor Blau
2023-12-01 8:31 ` Patrick Steinhardt
2023-12-01 20:02 ` Taylor Blau
2023-12-04 8:49 ` Patrick Steinhardt
2023-12-12 8:12 ` Jeff King
2023-12-15 15:37 ` Taylor Blau
2023-12-21 11:13 ` Jeff King
2024-01-04 22:22 ` Taylor Blau
2023-12-14 22:23 ` [PATCH v2 00/26] " Taylor Blau
2023-12-14 22:23 ` [PATCH v2 01/26] pack-objects: free packing_data in more places Taylor Blau
2023-12-14 22:23 ` [PATCH v2 02/26] pack-bitmap-write: deep-clear the `bb_commit` slab Taylor Blau
2023-12-14 22:23 ` [PATCH v2 03/26] pack-bitmap: plug leak in find_objects() Taylor Blau
2023-12-14 22:23 ` [PATCH v2 04/26] midx: factor out `fill_pack_info()` Taylor Blau
2023-12-14 22:23 ` [PATCH v2 05/26] midx: implement `BTMP` chunk Taylor Blau
2023-12-14 22:23 ` [PATCH v2 06/26] midx: implement `midx_locate_pack()` Taylor Blau
2023-12-14 22:23 ` [PATCH v2 07/26] pack-bitmap: pass `bitmapped_pack` struct to pack-reuse functions Taylor Blau
2023-12-14 22:23 ` [PATCH v2 08/26] ewah: implement `bitmap_is_empty()` Taylor Blau
2023-12-14 22:24 ` [PATCH v2 09/26] pack-bitmap: simplify `reuse_partial_packfile_from_bitmap()` signature Taylor Blau
2023-12-14 22:24 ` [PATCH v2 10/26] pack-bitmap: return multiple packs via `reuse_partial_packfile_from_bitmap()` Taylor Blau
2023-12-14 22:24 ` [PATCH v2 11/26] pack-objects: parameterize pack-reuse routines over a single pack Taylor Blau
2023-12-14 22:24 ` [PATCH v2 12/26] pack-objects: keep track of `pack_start` for each reuse pack Taylor Blau
2023-12-14 22:24 ` [PATCH v2 13/26] pack-objects: pass `bitmapped_pack`'s to pack-reuse functions Taylor Blau
2023-12-14 22:24 ` [PATCH v2 14/26] pack-objects: prepare `write_reused_pack()` for multi-pack reuse Taylor Blau
2023-12-14 22:24 ` [PATCH v2 15/26] pack-objects: prepare `write_reused_pack_verbatim()` " Taylor Blau
2023-12-14 22:24 ` [PATCH v2 16/26] pack-objects: include number of packs reused in output Taylor Blau
2023-12-14 22:24 ` [PATCH v2 17/26] git-compat-util.h: implement checked size_t to uint32_t conversion Taylor Blau
2023-12-14 22:24 ` [PATCH v2 18/26] midx: implement `midx_preferred_pack()` Taylor Blau
2023-12-14 22:24 ` [PATCH v2 19/26] pack-revindex: factor out `midx_key_to_pack_pos()` helper Taylor Blau
2023-12-14 22:24 ` [PATCH v2 20/26] pack-revindex: implement `midx_pair_to_pack_pos()` Taylor Blau
2023-12-14 22:24 ` [PATCH v2 21/26] pack-bitmap: prepare to mark objects from multiple packs for reuse Taylor Blau
2023-12-14 22:24 ` [PATCH v2 22/26] pack-objects: add tracing for various packfile metrics Taylor Blau
2023-12-14 22:24 ` [PATCH v2 23/26] t/test-lib-functions.sh: implement `test_trace2_data` helper Taylor Blau
2023-12-14 22:24 ` [PATCH v2 24/26] pack-objects: allow setting `pack.allowPackReuse` to "single" Taylor Blau
2023-12-14 22:24 ` [PATCH v2 25/26] pack-bitmap: enable reuse from all bitmapped packs Taylor Blau
2023-12-14 22:24 ` [PATCH v2 26/26] t/perf: add performance tests for multi-pack reuse Taylor Blau
2023-12-15 0:06 ` [PATCH v2 00/26] pack-objects: multi-pack verbatim reuse Junio C Hamano
2023-12-15 0:38 ` Taylor Blau
2023-12-15 0:40 ` Junio C Hamano
2023-12-15 1:25 ` Taylor Blau
2023-12-21 10:51 ` Jeff King
2024-01-04 22:24 ` Taylor Blau
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZWhhi15VpeCRflDB@tanuki \
--to=ps@pks.im \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=me@ttaylorr.com \
--cc=peff@peff.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).