From: Taylor Blau <me@ttaylorr.com>
To: Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com>
Cc: git@vger.kernel.org, sluongng@gmail.com,
Derrick Stolee <derrickstolee@github.com>,
Derrick Stolee <dstolee@microsoft.com>
Subject: Re: [PATCH] multi-pack-index: repack batches below --batch-size
Date: Tue, 11 Aug 2020 11:50:13 -0400 [thread overview]
Message-ID: <20200811155013.GF19871@syl.lan> (raw)
In-Reply-To: <pull.698.git.1597159818457.gitgitgadget@gmail.com>
On Tue, Aug 11, 2020 at 03:30:18PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
>
> The --batch-size=<size> option of 'git multi-pack-index repack' is
> intended to limit the amount of work done by the repack. In the case of
> a large repository, this command should repack a number of small
> pack-files but leave the large pack-files alone. Most often, the
> repository has one large pack-file from a 'git clone' operation and
> number of smaller pack-files from incremental 'git fetch' operations.
>
> The issue with '--batch-size' is that it also _prevents_ the repack from
> happening if the expected size of the resulting pack-file is too small.
> This was intended as a way to avoid frequent churn of small pack-files,
> but it has mostly caused confusion when a repository is of "medium"
> size. That is, not enormous like the Windows OS repository, but also not
> so small that this incremental repack isn't valuable.
>
> The solution presented here is to collect pack-files for repack if their
> expected size is smaller than the batch-size parameter until either the
> total expected size exceeds the batch-size or all pack-files are
> considered. If there are at least two pack-files, then these are
> combined to a new pack-file whose size should not be too much larger
> than the batch-size.
>
> This new strategy should succeed in keeping the number of pack-files
> small in these "medium" size repositories. The concern about churn is
> likely not interesting, as the real control over that is the frequency
> in which the repack command is run.
OK. This '--batch-size' parameter is a little magical to me, but the
behavior you are describing seems sane. I think that your assessment of
the down-side is reasonable, too.
Looks fine to me.
Reviewed-by: Taylor Blau <me@ttaylorr.com>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
> multi-pack-index: repack batches below --batch-size
>
> As reported [1], the 'git multi-pack-index repack' command has some
> unexpected behavior due to the nature of "expected size" for un-thinned
> fetch packs and the fact that the batch size requires the total size to
> be at least as large as that batch-size. By removing this minimum size
> restriction, we will repack more frequently and prevent this "many
> pack-file" problems.
>
> [1]
> https://lore.kernel.org/git/6FA8F54A-C92D-497B-895F-AC6E8287AACD@gmail.com/
>
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-698%2Fderrickstolee%2Fmidx-repack-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-698/derrickstolee/midx-repack-v1
> Pull-Request: https://github.com/gitgitgadget/git/pull/698
>
> Documentation/git-multi-pack-index.txt | 11 ++++++-----
> midx.c | 2 +-
> t/t5319-multi-pack-index.sh | 18 ++++++++++++++++++
> 3 files changed, 25 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
> index 0c6619493c..eb0caa0439 100644
> --- a/Documentation/git-multi-pack-index.txt
> +++ b/Documentation/git-multi-pack-index.txt
> @@ -51,11 +51,12 @@ repack::
> multi-pack-index, then divide by the total number of objects in
> the pack and multiply by the pack size. We select packs with
> expected size below the batch size until the set of packs have
> - total expected size at least the batch size. If the total size
> - does not reach the batch size, then do nothing. If a new pack-
> - file is created, rewrite the multi-pack-index to reference the
> - new pack-file. A later run of 'git multi-pack-index expire' will
> - delete the pack-files that were part of this batch.
> + total expected size at least the batch size, or all pack-files
> + are considered. If only one pack-file is selected, then do
> + nothing. If a new pack-file is created, rewrite the
> + multi-pack-index to reference the new pack-file. A later run of
> + 'git multi-pack-index expire' will delete the pack-files that
> + were part of this batch.
> +
> If `repack.packKeptObjects` is `false`, then any pack-files with an
> associated `.keep` file will not be selected for the batch to repack.
> diff --git a/midx.c b/midx.c
> index 6d1584ca51..38690b46c9 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -1371,7 +1371,7 @@ static int fill_included_packs_batch(struct repository *r,
>
> free(pack_info);
>
> - if (total_size < batch_size || packs_to_repack < 2)
> + if (packs_to_repack < 2)
> return 1;
>
> return 0;
> diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
> index 7214cab36c..b05190f500 100755
> --- a/t/t5319-multi-pack-index.sh
> +++ b/t/t5319-multi-pack-index.sh
> @@ -643,6 +643,7 @@ test_expect_success 'expire respects .keep files' '
> '
>
> test_expect_success 'repack --batch-size=0 repacks everything' '
> + cp -r dup dup2 &&
> (
> cd dup &&
> rm .git/objects/pack/*.keep &&
> @@ -662,4 +663,21 @@ test_expect_success 'repack --batch-size=0 repacks everything' '
> )
> '
>
> +test_expect_success 'repack --batch-size=<large> repacks everything' '
> + (
> + cd dup2 &&
> + rm .git/objects/pack/*.keep &&
> + ls .git/objects/pack/*idx >idx-list &&
> + test_line_count = 2 idx-list &&
> + git multi-pack-index repack --batch-size=2000000 &&
> + ls .git/objects/pack/*idx >idx-list &&
> + test_line_count = 3 idx-list &&
> + test-tool read-midx .git/objects | grep idx >midx-list &&
> + test_line_count = 3 midx-list &&
> + git multi-pack-index expire &&
> + ls -al .git/objects/pack/*idx >idx-list &&
> + test_line_count = 1 idx-list
> + )
> +'
> +
> test_done
>
> base-commit: 47ae905ffb98cc4d4fd90083da6bc8dab55d9ecc
> --
> gitgitgadget
Thanks,
Taylor
next prev parent reply other threads:[~2020-08-11 15:50 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-08-11 15:30 [PATCH] multi-pack-index: repack batches below --batch-size Derrick Stolee via GitGitGadget
2020-08-11 15:50 ` Taylor Blau [this message]
2020-08-11 21:06 ` Junio C Hamano
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200811155013.GF19871@syl.lan \
--to=me@ttaylorr.com \
--cc=derrickstolee@github.com \
--cc=dstolee@microsoft.com \
--cc=git@vger.kernel.org \
--cc=gitgitgadget@gmail.com \
--cc=sluongng@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).