git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com>
To: git@vger.kernel.org
Cc: gitster@pobox.com, me@ttayllorr.com, Derrick Stolee <stolee@gmail.com>
Subject: [PATCH 0/2] Fix background maintenance regression in Git 2.45.0
Date: Thu, 18 Jul 2024 19:55:44 +0000	[thread overview]
Message-ID: <pull.1764.git.1721332546.gitgitgadget@gmail.com> (raw)

Here is an issue I noticed while exploring issues with my local copy of a
large monorepo. I was intending to show some engineers how nice the objects
were maintained by background maintenance, but saw hundreds of small
pack-files that were up to two months old. This time matched when I upgraded
to the microsoft/git fork that included the 2.45.0 release of Git.

The issue is that 'git multi-pack-index repack' was taught to call 'git
pack-objects' with the new '--stdin-packs' option. However, this changes the
object selection algorithm. Instead of using the objects referenced by the
multi-pack-index, it compares pack-files using a list of "included" and
"excluded" pack-files. This loses some granularity of how the
multi-pack-index chooses among duplicate objects.

The end result is that some objects that would normally have been included
in the new pack-file are no longer included. The copy that the
multi-pack-index references is in the pack-file that was intended to be
repacked, so that pack-file cannot be expired in the next 'git
multi-pack-index expire' step and is included again in the batch of objects
to repack.

In the context of the change that is reverted by this series, it seems the
motivation of the change was two-fold:

 1. some I/O benefits to using pack names over object names, and
 2. the ability to use an object walk to improve delta compression.

In my local prototyping, I've found that we could improve 'git pack-objects'
to use an object walk when given a set of objects over stdin without needing
to use pack-file names. I do not believe the '--stdin-packs' option should
be used for the 'git multi-pack-index repack' mechanism, or at least should
be done with great care and only in specific cases where some assumptions
can be made around duplicate objects and closure under reachability.

However, the prototype I've built to get these benefits is non-trivial due
to working to guarantee that partial clones do not accidentally download
missing blobs. That will follow in a separate series that can be reviewed at
a slower pace.

Thanks, -Stolee

Derrick Stolee (2):
  t5319: add failing test case for repack/expire
  midx-write: revert use of --stdin-packs

 midx-write.c                | 18 ++++++------
 t/t5319-multi-pack-index.sh | 55 +++++++++++++++++++++++++++++++++++++
 2 files changed, 64 insertions(+), 9 deletions(-)


base-commit: bea9ecd24b0c3bf06cab4a851694fe09e7e51408
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1764%2Fderrickstolee%2Fincremental-repack-fix-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1764/derrickstolee/incremental-repack-fix-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/1764
-- 
gitgitgadget

             reply	other threads:[~2024-07-18 19:55 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-18 19:55 Derrick Stolee via GitGitGadget [this message]
2024-07-18 19:55 ` [PATCH 1/2] t5319: add failing test case for repack/expire Derrick Stolee via GitGitGadget
2024-07-18 19:55 ` [PATCH 2/2] midx-write: revert use of --stdin-packs Derrick Stolee via GitGitGadget
2024-07-18 21:57 ` [PATCH 0/2] Fix background maintenance regression in Git 2.45.0 Junio C Hamano
2024-07-18 22:38   ` Taylor Blau
2024-07-19 13:23     ` Derrick Stolee
2024-07-19 13:24   ` Derrick Stolee
2024-07-19 15:13     ` Junio C Hamano
2024-07-19 16:20       ` Derrick Stolee
2024-07-18 22:50 ` Taylor Blau
2024-07-19 13:21   ` Derrick Stolee
2024-07-19 13:38     ` Taylor Blau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=pull.1764.git.1721332546.gitgitgadget@gmail.com \
    --to=gitgitgadget@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=me@ttayllorr.com \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).