git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Patrick Steinhardt <ps@pks.im>
To: Taylor Blau <me@ttaylorr.com>
Cc: git@vger.kernel.org, Elijah Newren <newren@gmail.com>,
	Jeff King <peff@peff.net>, Junio C Hamano <gitster@pobox.com>
Subject: Re: [PATCH 16/17] midx: implement MIDX compaction
Date: Tue, 9 Dec 2025 08:21:32 +0100	[thread overview]
Message-ID: <aTfN_PycU9ag8c0u@pks.im> (raw)
In-Reply-To: <c136b2e179d02321de7e7b3f1b6c748cb434d68d.1765053054.git.me@ttaylorr.com>

On Sat, Dec 06, 2025 at 03:31:47PM -0500, Taylor Blau wrote:
> diff --git a/Documentation/git-multi-pack-index.adoc b/Documentation/git-multi-pack-index.adoc
> index 164cf1f2291..a9664e77411 100644
> --- a/Documentation/git-multi-pack-index.adoc
> +++ b/Documentation/git-multi-pack-index.adoc
> @@ -12,6 +12,8 @@ SYNOPSIS
>  'git multi-pack-index' [<options>] write [--preferred-pack=<pack>]
>  		         [--[no-]bitmap] [--[no-]incremental] [--[no-]stdin-packs]
>  		         [--refs-snapshot=<path>]
> +'git multi-pack-index' [<options>] compact [--[no-]incremental]
> +		         <from> <to>
>  'git multi-pack-index' [<options>] verify
>  'git multi-pack-index' [<options>] expire
>  'git multi-pack-index' [<options>] repack [--batch-size=<size>]
> @@ -83,6 +85,17 @@ marker).
>  		necessary.
>  --
>  
> +compact::
> +	Write a new MIDX layer containing only objects and packs present
> +	in the range `<from>` to `<to>`, where both arguments are
> +	checksums of existing layers in the MIDX chain.
> ++
> +--
> +	--incremental::
> +		Write the result to a MIDX chain instead of writing a
> +		stand-alone MIDX. Incompatible with `--bitmap`.

Interesting. What would happen if you compact a subrange of the MIDX
chain without incremental? Would the MIDX be completely replaced with a
MIDX that only covers these packs?

Also, the "--bitmap" flag does not exist yet, so the second sentence
probably needs to be introduced in the next commit.

> diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
> index c0c6c1760c0..9b0c2082cb3 100644
> --- a/builtin/multi-pack-index.c
> +++ b/builtin/multi-pack-index.c
> @@ -195,6 +204,63 @@ static int cmd_multi_pack_index_write(int argc, const char **argv,
>  	return ret;
>  }
>  
> +static int cmd_multi_pack_index_compact(int argc, const char **argv,
> +					const char *prefix,
> +					struct repository *repo)
> +{
> +	struct multi_pack_index *m, *cur;
> +	struct multi_pack_index *from_midx = NULL;
> +	struct multi_pack_index *to_midx = NULL;
> +	struct odb_source *source;
> +	int ret;
> +
> +	struct option *options;
> +	static struct option builtin_multi_pack_index_compact_options[] = {
> +		OPT_BIT(0, "incremental", &opts.flags,
> +			N_("write a new incremental MIDX"), MIDX_WRITE_INCREMENTAL),
> +		OPT_END(),
> +	};
> +
> +	repo_config(repo, git_multi_pack_index_write_config, NULL);
> +
> +	options = add_common_options(builtin_multi_pack_index_compact_options);
> +
> +	trace2_cmd_mode(argv[0]);
> +
> +	if (isatty(2))
> +		opts.flags |= MIDX_PROGRESS;
> +	argc = parse_options(argc, argv, prefix,
> +			     options, builtin_multi_pack_index_compact_usage,
> +			     0);
> +
> +	if (argc != 2)
> +		usage_with_options(builtin_multi_pack_index_compact_usage,
> +				   options);
> +	source = handle_object_dir_option(the_repository);
> +
> +	FREE_AND_NULL(options);
> +
> +	m = get_multi_pack_index(source);
> +
> +	for (cur = m; cur && !(from_midx && to_midx); cur = cur->base_midx) {
> +		const char *midx_csum = get_midx_checksum(cur);
> +
> +		if (!from_midx && !strcmp(midx_csum, argv[0]))
> +			from_midx = cur;
> +		if (!to_midx && !strcmp(midx_csum, argv[1]))
> +			to_midx = cur;
> +	}
> +
> +	if (!from_midx)
> +		die(_("could not find MIDX 'from': %s"), argv[0]);
> +	if (!to_midx)
> +		die(_("could not find MIDX 'to': %s"), argv[1]);
> +
> +	ret = write_midx_file_compact(source, from_midx, to_midx, opts.flags);
> +
> +	return ret;
> +}

Is it valid if `from_midx == to_midx`?

> diff --git a/midx-write.c b/midx-write.c
> index 7854561359d..fcbfedcd913 100644
> --- a/midx-write.c
> +++ b/midx-write.c
> @@ -953,6 +980,72 @@ static int fill_packs_from_midx(struct write_midx_context *ctx)
>  	return 0;
>  }
>  
> +static uint32_t compactible_packs_between(const struct multi_pack_index *from,
> +					  const struct multi_pack_index *to)
> +{
> +	uint32_t nr;
> +
> +	ASSERT(from && to);
> +
> +	nr = u32_add(to->num_packs, to->num_packs_in_base);
> +	if (nr < from->num_packs_in_base)
> +		BUG("unexpected number of packs in base during compaction: "
> +		    "%"PRIu32" < %"PRIu32, nr, from->num_packs_in_base);
> +
> +	return nr - from->num_packs_in_base;
> +}
> +
> +static int fill_packs_from_midx_range(struct write_midx_context *ctx,
> +				      int bitmap_order)
> +{
> +	struct multi_pack_index *m = ctx->compact_to;
> +	uint32_t packs_nr;
> +
> +	ASSERT(ctx->compact && !ctx->nr);
> +	ASSERT(ctx->compact_from);
> +	ASSERT(ctx->compact_to);
> +
> +	packs_nr = compactible_packs_between(ctx->compact_from,
> +					     ctx->compact_to);
> +
> +	ALLOC_GROW(ctx->info, packs_nr, ctx->alloc);
> +
> +	while (m != ctx->compact_from->base_midx) {
> +		uint32_t pack_int_id, preferred_pack_id;
> +		uint32_t i;
> +
> +		if (bitmap_order) {
> +			if (midx_preferred_pack(m, &preferred_pack_id) < 0)
> +				die(_("could not determine preferred pack"));

`midx_preferred_pack()` only returns a valid pack ID in case we've got a
reverse index, and as far as I understand we seem to only generate those
when computing bitmaps. I assume that this means that we can only
compact MIDX layers in bitmap order if they already were in bitmap order
before?

That would at least also make sense. We of course cannot randomly change
the order in the middle of our layers, as that would break later layers
that build on top.

> +		} else {
> +			preferred_pack_id = m->num_packs_in_base;
> +		}
> +
> +		pack_int_id = m->num_packs_in_base - ctx->compact_from->num_packs_in_base;
> +
> +		if (fill_pack_from_midx(&ctx->info[pack_int_id++], m,
> +					preferred_pack_id) < 0)
> +			return -1;
> +
> +		for (i = m->num_packs_in_base;
> +		     i < m->num_packs_in_base + m->num_packs; i++) {
> +			if (preferred_pack_id == i)
> +				continue;
> +
> +			if (fill_pack_from_midx(&ctx->info[pack_int_id++], m,
> +						i) < 0)
> +				return -1;
> +		}
> +

So the condition that should hold after this loop is `pack_int_id ==
m->num_packs`. Which is somewhat obvious: we skip one pack, but that
pack is the preferred pack that we have populated first.

> @@ -1101,11 +1216,18 @@ static int write_midx_internal(struct write_midx_opts *opts)
>  			 */
>  			if (ctx.incremental)
>  				ctx.base_midx = m;
> -			else if (!opts->packs_to_include)
> +			if (!opts->packs_to_include)
>  				ctx.m = m;

I'm a bit surprised by this change here. I would've expected that we
never pass `packs_to_include` when compacting, so why is this change
necessary?

>  		}
>  	}
>  
> +	/*
> +	 * If compacting MIDX layer(s) in the range [from, to], then the
> +	 * compacted MIDX will share the same base MIDX as 'from'.
> +	 */
> +	if (ctx.compact)
> +		ctx.base_midx = ctx.compact_from->base_midx;

Okay, here we overwrite `ctx.base_midx` that we might've already set in
the condition above. It's a bit confusing, doubly so because we may be
warning about the invalid MIDX and claim to ignore it, but ultimately we
don't.

> diff --git a/t/t5335-compact-multi-pack-index.sh b/t/t5335-compact-multi-pack-index.sh
> new file mode 100755
> index 00000000000..f889af7fb1d
> --- /dev/null
> +++ b/t/t5335-compact-multi-pack-index.sh
> @@ -0,0 +1,102 @@
> +#!/bin/sh
> +
> +test_description='multi-pack-index compaction'
> +
> +. ./test-lib.sh
> +
> +GIT_TEST_MULTI_PACK_INDEX=0
> +GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=0
> +GIT_TEST_MULTI_PACK_INDEX_WRITE_INCREMENTAL=0
> +
> +objdir=.git/objects
> +packdir=$objdir/pack
> +midxdir=$packdir/multi-pack-index.d
> +midx_chain=$midxdir/multi-pack-index-chain
> +
> +nth_line() {
> +	local n="$1"
> +	shift
> +	awk "NR==$n" "$@"
> +}
> +
> +write_packs () {
> +	for c in "$@"
> +	do
> +		test_commit "$c" &&

Nit: it might be sensible to disable housekeeping here. You strongly
depend on the on-disk shape of the objects, so if you by chance wrote
two objects starting with "17" we'd end up repacking and racing.

I've also got an upcoming patch series in mindthat I've got cooking to
make geometric compaction the default for auto-maintenance. We've got
many test suites that implicitly rely on the current algorithm used by
git-gc(1), so I'd love to avoid adding more.

[snip]
> +test_expect_success 'MIDX compaction with lex-ordered pack names' '
> +	git init midx-compact-lex-order &&
> +	(
> +		cd midx-compact-lex-order &&
> +
> +		write_packs A B C D E &&
> +		test_line_count = 5 $midx_chain &&
> +
> +		git multi-pack-index compact --incremental \
> +			"$(nth_line 2 "$midx_chain")" \
> +			"$(nth_line 4 "$midx_chain")" &&
> +		test_line_count = 3 $midx_chain &&
> +
> +		test_midx_layer_packs "$(nth_line 1 "$midx_chain")" A &&
> +		test_midx_layer_packs "$(nth_line 2 "$midx_chain")" B C D &&
> +		test_midx_layer_packs "$(nth_line 3 "$midx_chain")" E &&
> +
> +		test_midx_layer_object_uniqueness
> +	)
> +'

It would be nice to also test for requests that don't make sense: "from"
larger than "to", "from == to", missing "from" or "foo" and so on.

Patrick

  reply	other threads:[~2025-12-09  7:21 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-06 20:30 [PATCH 00/17] midx: incremental MIDX/bitmap layer compaction Taylor Blau
2025-12-06 20:31 ` [PATCH 01/17] midx: mark `get_midx_checksum()` arguments as const Taylor Blau
2025-12-08 18:26   ` Patrick Steinhardt
2025-12-09  1:41     ` Taylor Blau
2025-12-06 20:31 ` [PATCH 02/17] midx: split `get_midx_checksum()` by adding `get_midx_hash()` Taylor Blau
2025-12-08 18:25   ` Patrick Steinhardt
2025-12-09  1:42     ` Taylor Blau
2025-12-09  1:50       ` Taylor Blau
2025-12-09  6:27         ` Patrick Steinhardt
2025-12-06 20:31 ` [PATCH 03/17] builtin/multi-pack-index.c: make '--progress' a common option Taylor Blau
2025-12-06 20:31 ` [PATCH 04/17] git-multi-pack-index(1): remove non-existent incompatibility Taylor Blau
2025-12-06 20:31 ` [PATCH 05/17] git-multi-pack-index(1): align SYNOPSIS with 'git multi-pack-index -h' Taylor Blau
2025-12-06 20:31 ` [PATCH 06/17] t/t5319-multi-pack-index.sh: fix copy-and-paste error in t5319.39 Taylor Blau
2025-12-06 20:31 ` [PATCH 07/17] midx-write.c: don't use `pack_perm` when assigning `bitmap_pos` Taylor Blau
2025-12-08 18:26   ` Patrick Steinhardt
2025-12-09  1:59     ` Taylor Blau
2025-12-06 20:31 ` [PATCH 08/17] midx-write.c: introduce `struct write_midx_opts` Taylor Blau
2025-12-08 18:26   ` Patrick Steinhardt
2025-12-09  2:04     ` Taylor Blau
2025-12-06 20:31 ` [PATCH 09/17] midx: do not require packs to be sorted in lexicographic order Taylor Blau
2025-12-08 18:26   ` Patrick Steinhardt
2025-12-09  2:07     ` Taylor Blau
2025-12-09  2:11       ` Taylor Blau
2025-12-06 20:31 ` [PATCH 10/17] git-compat-util.h: introduce `u32_add()` Taylor Blau
2025-12-08 18:27   ` Patrick Steinhardt
2025-12-09  2:13     ` Taylor Blau
2025-12-06 20:31 ` [PATCH 11/17] midx-write.c: introduce `midx_pack_perm()` helper Taylor Blau
2025-12-06 20:31 ` [PATCH 12/17] midx-write.c: extract `fill_pack_from_midx()` Taylor Blau
2025-12-06 20:31 ` [PATCH 13/17] midx-write.c: enumerate `pack_int_id` values directly Taylor Blau
2025-12-08 18:27   ` Patrick Steinhardt
2025-12-09  2:14     ` Taylor Blau
2025-12-06 20:31 ` [PATCH 14/17] midx-write.c: factor fanout layering from `compute_sorted_entries()` Taylor Blau
2025-12-06 20:31 ` [PATCH 15/17] t/helper/test-read-midx.c: plug memory leak when selecting layer Taylor Blau
2025-12-08 18:27   ` Patrick Steinhardt
2025-12-09  2:16     ` Taylor Blau
2025-12-06 20:31 ` [PATCH 16/17] midx: implement MIDX compaction Taylor Blau
2025-12-09  7:21   ` Patrick Steinhardt [this message]
2025-12-06 20:31 ` [PATCH 17/17] midx: enable reachability bitmaps during " Taylor Blau
2025-12-09  7:21   ` Patrick Steinhardt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aTfN_PycU9ag8c0u@pks.im \
    --to=ps@pks.im \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=me@ttaylorr.com \
    --cc=newren@gmail.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).