From: Taylor Blau <me@ttaylorr.com>
To: Patrick Steinhardt <ps@pks.im>
Cc: git@vger.kernel.org, Derrick Stolee <stolee@gmail.com>
Subject: Re: [PATCH 3/8] builtin/maintenance: introduce "geometric-repack" task
Date: Fri, 17 Oct 2025 18:28:04 -0400 [thread overview]
Message-ID: <aPLC9FIDR8Y3ayn+@nand.local> (raw)
In-Reply-To: <20251016-pks-maintenance-geometric-strategy-v1-3-18943d474203@pks.im>
On Thu, Oct 16, 2025 at 09:26:34AM +0200, Patrick Steinhardt wrote:
> Introduce a new "geometric-repack" task. This task uses our geometric
> repack infrastructure as provided by git-repack(1) itself, which is ab
> strategy that especially hosting providers tend to use to amortize the
> costs of repacking objects.
>
> There is one issue though with geometric repacks, namely that they
> unconditionally pack all loose objects, regardless of whether or not
> they are reachable. This is done because it means that we can completely
> skip the reachability step, which significantly speeds up the operation.
> But it has the big downside that we are unable to expire objects over
> time.
>
> To address this issue we thus use a split strategy in this new task:
> whenever a geometric repack would merge together all packs, we instead
> do an all-into-one repack. By default, these all-into-one repacks have
> cruft packs enabled, so unreachable objects would now be written into
> their own pack. Consequently, they won't be soaked up during geometric
> repacking anymore and can be expired with the next full repack, assuming
> that their expiry date has surpassed.
Well put. Geometric repacking today is really only what objects appear
in the packfiles, not whether those objects are reachable or not. That's
partially by design: geometric repack operations are meant to be as
lightweight and quick as possible, so performing a potentially expensive
reachability traversal defeats the purpose.
This mirrors what GitHub does today with their own repository
maintenance implementation. There is some number of geometric repack
operations interspersed between full repacks which collapse the
geometric progression and move unreachable objects out into cruft packs.
So I think that what you did here makes a ton of sense to me. Ultimately
I think there is a middle ground for geometric repacking (well outside
of the scope for this series ;-), don't worry) that could make it do a
little bit of both.
If 'git pack-objects --stdin-packs' (what ultimately implements the
portion of geometric repacking that combines packs together) knew the
current state of a repository's references, it could mark the objects in
the packs to be combined as either reachable or unreachable. Then in a
specialized mode, you could exclude any objects which are unreachable
from the resulting pack, and take a separate pass to write out a cruft
pack containing those objects before ultimately deleting the combined
packs.
I think that is all possible to do, and I think there is a way we could
do it relatively quickly without harming the performance of geometric
repacking. When traversing and marking objects, we can stop as soon as
we see an object that is not contained in the packs that that we're
combining.
So I don't think we have to do a whole-repository walk, which would
indeed defeat the purpose of geometric repacking. The above procedure
would cause us to write out small cruft packs, but we could use the
--combine-cruft-below-size option of 'git repack' to prevent too many
small cruft packs from accumulating together.
Anyway, nothing of that has anything to do with what you wrote here ;-).
It was mostly an excuse for me to write down some of these thoughts that
I've had in my head and alluded to briefly a couple of weeks ago at Git
Merge. Expect some actual patches in this direction from me in the not
too distant future :-).
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
> Documentation/config/maintenance.adoc | 11 +++
> builtin/gc.c | 102 +++++++++++++++++++++++++
> t/t7900-maintenance.sh | 137 ++++++++++++++++++++++++++++++++++
> 3 files changed, 250 insertions(+)
>
> diff --git a/Documentation/config/maintenance.adoc b/Documentation/config/maintenance.adoc
> index 2f719342183..26dc5de423f 100644
> --- a/Documentation/config/maintenance.adoc
> +++ b/Documentation/config/maintenance.adoc
> @@ -75,6 +75,17 @@ maintenance.incremental-repack.auto::
> number of pack-files not in the multi-pack-index is at least the value
> of `maintenance.incremental-repack.auto`. The default value is 10.
>
> +maintenance.geometric-repack.auto::
> + This integer config option controls how often the `geometric-repack`
> + task should be run as part of `git maintenance run --auto`. If zero,
> + then the `geometric-repack` task will not run with the `--auto`
> + option. A negative value will force the task to run every time.
> + Otherwise, a positive value implies the command should run either when
> + there are packfiles that need to be merged together to retain the
> + geometric progression, or when there are at least this many loose
> + objects that would be written into a new packfile. The default value is
> + 100.
> +
OK. To make sure I understand: this limit is the minimum number of loose
objects would cause the geometric-repack task to run, unless there are
pack(s) which would be combined as a result of running a geometric
repack, in which case we run it regardless.
Is that right?
> maintenance.reflog-expire.auto::
> This integer config option controls how often the `reflog-expire` task
> should be run as part of `git maintenance run --auto`. If zero, then
> diff --git a/builtin/gc.c b/builtin/gc.c
> index 026d3a1d714..2c9ecd464d2 100644
> --- a/builtin/gc.c
> +++ b/builtin/gc.c
> @@ -34,6 +34,7 @@
> #include "pack-objects.h"
> #include "path.h"
> #include "reflog.h"
> +#include "repack.h"
Hey, neat ;-).
> @@ -1566,6 +1568,101 @@ static int maintenance_task_incremental_repack(struct maintenance_run_opts *opts
> return 0;
> }
>
> +static int maintenance_task_geometric_repack(struct maintenance_run_opts *opts,
> + struct gc_config *cfg)
> +{
> + struct pack_geometry geometry = {
> + .split_factor = 2,
I wonder if this should be configurable somewhere. It might not be a bad
idea to introduce a 'repack.geometricSplitFactor' configuration
variable, defaulting to two, and use that here. It would also be nice to
be able to run 'git repack --geometric -d' and have it fallback to that
split factor, since using "2" is so common that it's frustrating when I
forget to type it out explicitly ;-).
> + };
> + struct pack_objects_args po_args = {
> + .local = 1,
> + };
> + struct existing_packs existing_packs = EXISTING_PACKS_INIT;
> + struct string_list kept_packs = STRING_LIST_INIT_DUP;
> + struct child_process child = CHILD_PROCESS_INIT;
> + int ret;
> +
> + existing_packs.repo = the_repository;
> + existing_packs_collect(&existing_packs, &kept_packs);
> + pack_geometry_init(&geometry, &existing_packs, &po_args);
> + pack_geometry_split(&geometry);
> +
> + child.git_cmd = 1;
> +
> + strvec_pushl(&child.args, "repack", "-d", "-l", NULL);
> + if (geometry.split < geometry.pack_nr)
> + strvec_push(&child.args, "--geometric=2");
> + else
> + add_repack_all_option(cfg, NULL, &child.args);
Makes sense; if we're not merging any packs, we do an all-into-one
repack, otherwise we do a geometric one. Looks like the function
geometric_repack_auto_condition() below controls whether or not we even
take this path, which makes sense relative to the documentation you
wrote above.
> +static int geometric_repack_auto_condition(struct gc_config *cfg UNUSED)
> +{
> + struct pack_geometry geometry = {
> + .split_factor = 2,
> + };
> + struct pack_objects_args po_args = {
> + .local = 1,
> + };
> + struct existing_packs existing_packs = EXISTING_PACKS_INIT;
> + struct string_list kept_packs = STRING_LIST_INIT_DUP;
> + int auto_value = 100;
> + int ret;
> +
> + repo_config_get_int(the_repository, "maintenance.geometric-repack.auto",
> + &auto_value);
> + if (!auto_value)
> + return 0;
> + if (auto_value < 0)
> + return 1;
> +
> + existing_packs.repo = the_repository;
> + existing_packs_collect(&existing_packs, &kept_packs);
> + pack_geometry_init(&geometry, &existing_packs, &po_args);
> + pack_geometry_split(&geometry);
> +
> + /*
> + * When we'd merge at least two packs with one another we always
> + * perform the repack.
> + */
> + if (geometry.split) {
> + ret = 1;
> + goto out;
> + }
Hmm. I wish that we could somehow pass this information to the function
above so that we don't have to re-discover the fact that there are packs
to combine. I'm not familiar enough with the maintenance code to know
how difficult that would be to do, but it looks like at least the
gc_config pointer is shared between the auto condition and the task
itself.
That's kind of gross to tack on there, but I could see a compelling
argument for passing around an extra void pointer between the two that
would allow us to propagate this kind of data between the auto condition
and the task itself. It's not super expensive to do so I don't think not
doing it is a show-stopper at least from a performance perspective, but
it does seem like a good opportunity to DRY things up a bit.
> diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
> index ddd273d8dc2..83a373fe94b 100755
> --- a/t/t7900-maintenance.sh
> +++ b/t/t7900-maintenance.sh
> @@ -465,6 +465,143 @@ test_expect_success 'maintenance.incremental-repack.auto (when config is unset)'
> )
> '
>
> +run_and_verify_geometric_pack () {
> + EXPECTED_PACKS="$1" &&
> +
> + # Verify that we perform a geometric repack.
> + rm -f "trace2.txt" &&
> + GIT_TRACE2_EVENT="$(pwd)/trace2.txt" \
> + git maintenance run --task=geometric-repack 2>/dev/null &&
> + test_subcommand git repack -d -l --geometric=2 --quiet --write-midx <trace2.txt &&
Makes sense. I do think the test_subcommand thing is a little fragile
here, but verifying that the resulting pack structure forms a geometric
progression feels like overkill for this test, so I think what you wrote
here makes sense.
As an aside, would you mind wrapping these lines instead of putting the
command-line invocation all together on a single line?
> +
> + # Verify that the number of packfiles matches our expectation.
> + ls -l .git/objects/pack/*.pack >packfiles &&
> + test_line_count = "$EXPECTED_PACKS" packfiles &&
> +
> + # And verify that there are no loose objects anymore.
> + cat >expect <<-\EOF &&
> + info
> + pack
> + EOF
> + ls .git/objects >actual &&
I wonder if there is an easier way to check for loose objects here that
doesn't require you to know that the "info" and "pack" directories
exist. Perhaps something like:
test_stdout_line_count = 0 find .git/objects/?? -type f
, or even
find .git/objects/?? -type f >loose.objs &&
test_must_be_empty loose.objs
> +test_expect_success 'geometric repacking task' '
> + test_when_finished "rm -rf repo" &&
> + git init repo &&
> + (
> + cd repo &&
> + git config set maintenance.auto false &&
> + test_commit initial &&
> +
> + # The initial repack causes an all-into-one repack.
> + GIT_TRACE2_EVENT="$(pwd)/initial-repack.txt" \
> + git maintenance run --task=geometric-repack 2>/dev/null &&
> + test_subcommand git repack -d -l --cruft --cruft-expiration=2.weeks.ago --quiet --write-midx <initial-repack.txt &&
> +
> + # Repacking should now cause a no-op geometric repack because
> + # no packfiles need to be combined.
> + ls -l .git/objects/pack >before &&
> + run_and_verify_geometric_pack 1 &&
> + ls -l .git/objects/pack >after &&
> + test_cmp before after &&
> +
> + # This incremental change creates a new packfile that only
> + # soaks up loose objects. The packfiles are not getting merged
> + # at this point.
> + test_commit loose &&
> + run_and_verify_geometric_pack 2 &&
I wonder if you want to harden this test a little bit to ensure that the
there is only one new pack being created here, and we're not seeing
e.g., the removal of the existing pack and creation of two new packs.
I dunno, that may be overkill for this test, and I certainly don't feel
strongly about it.
> +
> + # Both packfiles have 3 objects, so the next run would cause us
> + # to merge both packfiles together. This should be turned into
Perhaps s/both/all/ ? What you wrote is not wrong, of course, but I
think "all" more clearly communicates that we are only doing an
all-into-one because the geometric repack would have combined everything
together anyway.
The rest of the changes look good to me.
Thanks,
Taylor
next prev parent reply other threads:[~2025-10-17 22:28 UTC|newest]
Thread overview: 69+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-16 7:26 [PATCH 0/8] builtin/maintenance: introduce "geometric" strategy Patrick Steinhardt
2025-10-16 7:26 ` [PATCH 1/8] builtin/gc: remove global `repack` variable Patrick Steinhardt
2025-10-16 20:07 ` Justin Tobler
2025-10-17 20:58 ` Taylor Blau
2025-10-16 7:26 ` [PATCH 2/8] builtin/gc: make `too_many_loose_objects()` reusable without GC config Patrick Steinhardt
2025-10-16 20:59 ` Junio C Hamano
2025-10-16 7:26 ` [PATCH 3/8] builtin/maintenance: introduce "geometric-repack" task Patrick Steinhardt
2025-10-16 20:51 ` Justin Tobler
2025-10-17 6:13 ` Patrick Steinhardt
2025-10-17 22:28 ` Taylor Blau [this message]
2025-10-21 13:00 ` Patrick Steinhardt
2025-10-23 19:19 ` Taylor Blau
2025-10-24 5:44 ` Patrick Steinhardt
2025-10-16 7:26 ` [PATCH 4/8] builtin/maintenance: don't silently ignore invalid strategy Patrick Steinhardt
2025-10-16 7:26 ` [PATCH 5/8] builtin/maintenance: run maintenance tasks depending on type Patrick Steinhardt
2025-10-16 7:26 ` [PATCH 6/8] builtin/maintenance: extend "maintenance.strategy" to manual maintenance Patrick Steinhardt
2025-10-16 7:26 ` [PATCH 7/8] builtin/maintenance: make "gc" strategy accessible Patrick Steinhardt
2025-10-16 7:26 ` [PATCH 8/8] builtin/maintenance: introduce "geometric" strategy Patrick Steinhardt
2025-10-21 14:13 ` [PATCH v2 0/9] " Patrick Steinhardt
2025-10-21 14:13 ` [PATCH v2 1/9] builtin/gc: remove global `repack` variable Patrick Steinhardt
2025-10-21 14:13 ` [PATCH v2 2/9] builtin/gc: make `too_many_loose_objects()` reusable without GC config Patrick Steinhardt
2025-10-21 14:13 ` [PATCH v2 3/9] builtin/maintenance: introduce "geometric-repack" task Patrick Steinhardt
2025-10-23 19:29 ` Taylor Blau
2025-10-24 5:45 ` Patrick Steinhardt
2025-10-21 14:13 ` [PATCH v2 4/9] builtin/maintenance: make the geometric factor configurable Patrick Steinhardt
2025-10-23 19:33 ` Taylor Blau
2025-10-24 5:45 ` Patrick Steinhardt
2025-10-24 19:02 ` Taylor Blau
2025-10-21 14:13 ` [PATCH v2 5/9] builtin/maintenance: don't silently ignore invalid strategy Patrick Steinhardt
2025-10-23 21:31 ` Taylor Blau
2025-10-21 14:13 ` [PATCH v2 6/9] builtin/maintenance: run maintenance tasks depending on type Patrick Steinhardt
2025-10-23 21:34 ` Taylor Blau
2025-10-21 14:13 ` [PATCH v2 7/9] builtin/maintenance: extend "maintenance.strategy" to manual maintenance Patrick Steinhardt
2025-10-21 14:13 ` [PATCH v2 8/9] builtin/maintenance: make "gc" strategy accessible Patrick Steinhardt
2025-10-21 14:13 ` [PATCH v2 9/9] builtin/maintenance: introduce "geometric" strategy Patrick Steinhardt
2025-10-23 21:49 ` Taylor Blau
2025-10-24 5:45 ` Patrick Steinhardt
2025-10-23 16:48 ` [PATCH v2 0/9] " Junio C Hamano
2025-10-23 21:50 ` Taylor Blau
2025-10-24 6:57 ` [PATCH v3 00/10] " Patrick Steinhardt
2025-10-24 6:57 ` [PATCH v3 01/10] builtin/gc: remove global `repack` variable Patrick Steinhardt
2025-10-24 6:57 ` [PATCH v3 02/10] builtin/gc: make `too_many_loose_objects()` reusable without GC config Patrick Steinhardt
2025-10-24 6:57 ` [PATCH v3 03/10] builtin/maintenance: introduce "geometric-repack" task Patrick Steinhardt
2025-10-25 19:15 ` Jeff King
2025-10-27 8:24 ` Patrick Steinhardt
2025-10-27 14:25 ` Jeff King
2025-10-24 6:57 ` [PATCH v3 04/10] builtin/maintenance: make the geometric factor configurable Patrick Steinhardt
2025-10-24 6:57 ` [PATCH v3 05/10] builtin/maintenance: don't silently ignore invalid strategy Patrick Steinhardt
2025-10-24 6:57 ` [PATCH v3 06/10] builtin/maintenance: improve readability of strategies Patrick Steinhardt
2025-10-24 6:57 ` [PATCH v3 07/10] builtin/maintenance: run maintenance tasks depending on type Patrick Steinhardt
2025-10-24 6:57 ` [PATCH v3 08/10] builtin/maintenance: extend "maintenance.strategy" to manual maintenance Patrick Steinhardt
2025-10-24 6:57 ` [PATCH v3 09/10] builtin/maintenance: make "gc" strategy accessible Patrick Steinhardt
2025-10-24 6:57 ` [PATCH v3 10/10] builtin/maintenance: introduce "geometric" strategy Patrick Steinhardt
2025-10-24 19:03 ` [PATCH v3 00/10] " Taylor Blau
2025-10-24 19:11 ` Junio C Hamano
2025-10-27 8:30 ` [PATCH v4 " Patrick Steinhardt
2025-10-27 8:30 ` [PATCH v4 01/10] builtin/gc: remove global `repack` variable Patrick Steinhardt
2025-10-27 8:30 ` [PATCH v4 02/10] builtin/gc: make `too_many_loose_objects()` reusable without GC config Patrick Steinhardt
2025-10-27 8:30 ` [PATCH v4 03/10] builtin/maintenance: introduce "geometric-repack" task Patrick Steinhardt
2025-10-27 8:30 ` [PATCH v4 04/10] builtin/maintenance: make the geometric factor configurable Patrick Steinhardt
2025-10-27 8:30 ` [PATCH v4 05/10] builtin/maintenance: don't silently ignore invalid strategy Patrick Steinhardt
2025-10-27 8:30 ` [PATCH v4 06/10] builtin/maintenance: improve readability of strategies Patrick Steinhardt
2025-10-27 8:30 ` [PATCH v4 07/10] builtin/maintenance: run maintenance tasks depending on type Patrick Steinhardt
2025-10-27 8:30 ` [PATCH v4 08/10] builtin/maintenance: extend "maintenance.strategy" to manual maintenance Patrick Steinhardt
2025-10-27 8:30 ` [PATCH v4 09/10] builtin/maintenance: make "gc" strategy accessible Patrick Steinhardt
2025-10-27 8:31 ` [PATCH v4 10/10] builtin/maintenance: introduce "geometric" strategy Patrick Steinhardt
2025-10-27 15:53 ` [PATCH v4 00/10] " Junio C Hamano
2025-10-27 20:05 ` Patrick Steinhardt
2025-10-27 20:58 ` Junio C Hamano
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aPLC9FIDR8Y3ayn+@nand.local \
--to=me@ttaylorr.com \
--cc=git@vger.kernel.org \
--cc=ps@pks.im \
--cc=stolee@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).