From: "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com>
To: git@vger.kernel.org
Cc: gitster@pobox.com, johannes.schindelin@gmx.de, peff@peff.net,
ps@pks.im, me@ttaylorr.com, johncai86@gmail.com,
newren@gmail.com, christian.couder@gmail.com,
kristofferhaugsbakk@fastmail.com,
Derrick Stolee <stolee@gmail.com>,
Derrick Stolee <stolee@gmail.com>
Subject: [PATCH v2 12/17] repack: add --path-walk option
Date: Sun, 20 Oct 2024 13:43:25 +0000 [thread overview]
Message-ID: <834c9ea270932e3d25e2018cca4e74831345f592.1729431810.git.gitgitgadget@gmail.com> (raw)
In-Reply-To: <pull.1813.v2.git.1729431810.gitgitgadget@gmail.com>
From: Derrick Stolee <stolee@gmail.com>
Since 'git pack-objects' supports a --path-walk option, allow passing it
through in 'git repack'. This presents interesting testing opportunities for
comparing the different repacking strategies against each other.
In my copy of the Git repository, the new tests in p5313 show these
results:
Test this tree
-------------------------------------------------------------
5313.10: repack 27.88(150.23+2.70)
5313.11: repack size 228.2M
5313.12: repack with --path-walk 134.59(148.77+0.81)
5313.13: repack size with --path-walk 209.7M
Note that the 'git pack-objects --path-walk' feature is not integrated
with threads. Look forward to a future change that will introduce
threading to improve the time performance of this feature with
equivalent space performance.
For the microsoft/fluentui repo [1] had some interesting aspects for the
previous tests in p5313, so here are the repack results:
Test this tree
-------------------------------------------------------------
5313.10: repack 91.76(680.94+2.48)
5313.11: repack size 439.1M
5313.12: repack with --path-walk 110.35(130.46+0.74)
5313.13: repack size with --path-walk 155.3M
[1] https://github.com/microsoft/fluentui
Here, we see the significant improvement of a full repack using this
strategy. The name-hash collisions in this repo cause the space
problems. Those collisions also cause the repack command to spend a lot
of cycles trying to find delta bases among files that are not actually
very similar, so the lack of threading with the --path-walk feature is
less pronounced in the process time.
For the Linux kernel repository, we have these stats:
Test this tree
---------------------------------------------------------------
5313.10: repack 553.61(1929.41+30.31)
5313.11: repack size 2.5G
5313.12: repack with --path-walk 1777.63(2044.16+7.47)
5313.13: repack size with --path-walk 2.5G
This demonstrates that the --path-walk feature does not always present
measurable improvements, especially in cases where the name-hash has
very few collisions.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
Documentation/git-repack.txt | 17 ++++++++++++++++-
builtin/repack.c | 9 ++++++++-
t/perf/p5313-pack-objects.sh | 18 ++++++++++++++++++
3 files changed, 42 insertions(+), 2 deletions(-)
diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index c902512a9e8..4ec59cd27b1 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -9,7 +9,9 @@ git-repack - Pack unpacked objects in a repository
SYNOPSIS
--------
[verse]
-'git repack' [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [-m] [--window=<n>] [--depth=<n>] [--threads=<n>] [--keep-pack=<pack-name>] [--write-midx]
+'git repack' [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [-m]
+ [--window=<n>] [--depth=<n>] [--threads=<n>] [--keep-pack=<pack-name>]
+ [--write-midx] [--path-walk]
DESCRIPTION
-----------
@@ -249,6 +251,19 @@ linkgit:git-multi-pack-index[1]).
Write a multi-pack index (see linkgit:git-multi-pack-index[1])
containing the non-redundant packs.
+--path-walk::
+ This option passes the `--path-walk` option to the underlying
+ `git pack-options` process (see linkgit:git-pack-objects[1]).
+ By default, `git pack-objects` walks objects in an order that
+ presents trees and blobs in an order unrelated to the path they
+ appear relative to a commit's root tree. The `--path-walk` option
+ enables a different walking algorithm that organizes trees and
+ blobs by path. This has the potential to improve delta compression
+ especially in the presence of filenames that cause collisions in
+ Git's default name-hash algorithm. Due to changing how the objects
+ are walked, this option is not compatible with `--delta-islands`
+ or `--filter`.
+
CONFIGURATION
-------------
diff --git a/builtin/repack.c b/builtin/repack.c
index cb4420f0856..af3f218ced7 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -39,7 +39,9 @@ static int run_update_server_info = 1;
static char *packdir, *packtmp_name, *packtmp;
static const char *const git_repack_usage[] = {
- N_("git repack [<options>]"),
+ N_("git repack [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [-m]\n"
+ "[--window=<n>] [--depth=<n>] [--threads=<n>] [--keep-pack=<pack-name>]\n"
+ "[--write-midx] [--full-path-walk]"),
NULL
};
@@ -58,6 +60,7 @@ struct pack_objects_args {
int no_reuse_object;
int quiet;
int local;
+ int path_walk;
struct list_objects_filter_options filter_options;
};
@@ -289,6 +292,8 @@ static void prepare_pack_objects(struct child_process *cmd,
strvec_pushf(&cmd->args, "--no-reuse-delta");
if (args->no_reuse_object)
strvec_pushf(&cmd->args, "--no-reuse-object");
+ if (args->path_walk)
+ strvec_pushf(&cmd->args, "--path-walk");
if (args->local)
strvec_push(&cmd->args, "--local");
if (args->quiet)
@@ -1182,6 +1187,8 @@ int cmd_repack(int argc,
N_("pass --no-reuse-delta to git-pack-objects")),
OPT_BOOL('F', NULL, &po_args.no_reuse_object,
N_("pass --no-reuse-object to git-pack-objects")),
+ OPT_BOOL(0, "path-walk", &po_args.path_walk,
+ N_("pass --path-walk to git-pack-objects")),
OPT_NEGBIT('n', NULL, &run_update_server_info,
N_("do not run git-update-server-info"), 1),
OPT__QUIET(&po_args.quiet, N_("be quiet")),
diff --git a/t/perf/p5313-pack-objects.sh b/t/perf/p5313-pack-objects.sh
index 840075f5691..b588066ddb0 100755
--- a/t/perf/p5313-pack-objects.sh
+++ b/t/perf/p5313-pack-objects.sh
@@ -56,4 +56,22 @@ test_size 'big pack size with --path-walk' '
test_file_size out
'
+test_perf 'repack' '
+ git repack -adf
+'
+
+test_size 'repack size' '
+ pack=$(ls .git/objects/pack/pack-*.pack) &&
+ test_file_size "$pack"
+'
+
+test_perf 'repack with --path-walk' '
+ git repack -adf --path-walk
+'
+
+test_size 'repack size with --path-walk' '
+ pack=$(ls .git/objects/pack/pack-*.pack) &&
+ test_file_size "$pack"
+'
+
test_done
--
gitgitgadget
next prev parent reply other threads:[~2024-10-20 13:43 UTC|newest]
Thread overview: 55+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-08 14:11 [PATCH 00/17] pack-objects: add --path-walk option for better deltas Derrick Stolee via GitGitGadget
2024-10-08 14:11 ` [PATCH 01/17] path-walk: introduce an object walk by path Derrick Stolee via GitGitGadget
2024-10-08 14:11 ` [PATCH 02/17] t6601: add helper for testing path-walk API Derrick Stolee via GitGitGadget
2024-10-08 14:11 ` [PATCH 03/17] path-walk: allow consumer to specify object types Derrick Stolee via GitGitGadget
2024-10-08 14:11 ` [PATCH 04/17] path-walk: allow visiting tags Derrick Stolee via GitGitGadget
2024-10-08 14:11 ` [PATCH 05/17] revision: create mark_trees_uninteresting_dense() Derrick Stolee via GitGitGadget
2024-10-08 14:11 ` [PATCH 06/17] path-walk: add prune_all_uninteresting option Derrick Stolee via GitGitGadget
2024-10-08 14:11 ` [PATCH 07/17] pack-objects: extract should_attempt_deltas() Derrick Stolee via GitGitGadget
2024-10-08 14:11 ` [PATCH 08/17] pack-objects: add --path-walk option Derrick Stolee via GitGitGadget
2024-10-28 19:54 ` Jonathan Tan
2024-10-29 18:07 ` Taylor Blau
2024-10-29 21:36 ` Jonathan Tan
2024-10-29 22:16 ` Taylor Blau
2024-10-31 2:04 ` Derrick Stolee
2024-10-31 2:14 ` Derrick Stolee
2024-10-31 21:02 ` Taylor Blau
2024-10-31 2:12 ` Derrick Stolee
2024-10-08 14:11 ` [PATCH 09/17] pack-objects: update usage to match docs Derrick Stolee via GitGitGadget
2024-10-08 14:11 ` [PATCH 10/17] p5313: add performance tests for --path-walk Derrick Stolee via GitGitGadget
2024-10-08 14:11 ` [PATCH 11/17] pack-objects: introduce GIT_TEST_PACK_PATH_WALK Derrick Stolee via GitGitGadget
2024-10-08 14:11 ` [PATCH 12/17] repack: add --path-walk option Derrick Stolee via GitGitGadget
2024-10-08 14:11 ` [PATCH 13/17] repack: update usage to match docs Derrick Stolee via GitGitGadget
2024-10-08 14:12 ` [PATCH 14/17] pack-objects: enable --path-walk via config Derrick Stolee via GitGitGadget
2024-10-08 14:12 ` [PATCH 15/17] scalar: enable path-walk during push " Derrick Stolee via GitGitGadget
2024-10-08 14:12 ` [PATCH 16/17] pack-objects: refactor path-walk delta phase Derrick Stolee via GitGitGadget
2024-10-08 14:12 ` [PATCH 17/17] pack-objects: thread the path-based compression Derrick Stolee via GitGitGadget
2024-10-20 13:43 ` [PATCH v2 00/17] pack-objects: add --path-walk option for better deltas Derrick Stolee via GitGitGadget
2024-10-20 13:43 ` [PATCH v2 01/17] path-walk: introduce an object walk by path Derrick Stolee via GitGitGadget
2024-10-20 13:43 ` [PATCH v2 02/17] t6601: add helper for testing path-walk API Derrick Stolee via GitGitGadget
2024-10-20 13:43 ` [PATCH v2 03/17] path-walk: allow consumer to specify object types Derrick Stolee via GitGitGadget
2024-10-20 13:43 ` [PATCH v2 04/17] path-walk: allow visiting tags Derrick Stolee via GitGitGadget
2024-10-20 13:43 ` [PATCH v2 05/17] revision: create mark_trees_uninteresting_dense() Derrick Stolee via GitGitGadget
2024-10-20 13:43 ` [PATCH v2 06/17] path-walk: add prune_all_uninteresting option Derrick Stolee via GitGitGadget
2024-10-20 13:43 ` [PATCH v2 07/17] pack-objects: extract should_attempt_deltas() Derrick Stolee via GitGitGadget
2024-10-20 13:43 ` [PATCH v2 08/17] pack-objects: add --path-walk option Derrick Stolee via GitGitGadget
2024-10-20 13:43 ` [PATCH v2 09/17] pack-objects: update usage to match docs Derrick Stolee via GitGitGadget
2024-10-20 13:43 ` [PATCH v2 10/17] p5313: add performance tests for --path-walk Derrick Stolee via GitGitGadget
2024-10-20 13:43 ` [PATCH v2 11/17] pack-objects: introduce GIT_TEST_PACK_PATH_WALK Derrick Stolee via GitGitGadget
2024-10-20 13:43 ` Derrick Stolee via GitGitGadget [this message]
2024-10-20 13:43 ` [PATCH v2 13/17] repack: update usage to match docs Derrick Stolee via GitGitGadget
2024-10-20 13:43 ` [PATCH v2 14/17] pack-objects: enable --path-walk via config Derrick Stolee via GitGitGadget
2024-10-20 13:43 ` [PATCH v2 15/17] scalar: enable path-walk during push " Derrick Stolee via GitGitGadget
2024-10-20 13:43 ` [PATCH v2 16/17] pack-objects: refactor path-walk delta phase Derrick Stolee via GitGitGadget
2024-10-20 13:43 ` [PATCH v2 17/17] pack-objects: thread the path-based compression Derrick Stolee via GitGitGadget
2024-10-21 21:43 ` [PATCH v2 00/17] pack-objects: add --path-walk option for better deltas Taylor Blau
2024-10-24 13:29 ` Derrick Stolee
2024-10-24 15:52 ` Taylor Blau
2024-10-28 5:46 ` Patrick Steinhardt
2024-10-28 16:47 ` Taylor Blau
2024-10-28 17:13 ` Derrick Stolee
2024-10-28 17:25 ` Taylor Blau
2024-10-28 19:46 ` Derrick Stolee
2024-10-29 18:02 ` Taylor Blau
2024-10-31 2:28 ` Derrick Stolee
2024-10-31 21:07 ` Taylor Blau
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=834c9ea270932e3d25e2018cca4e74831345f592.1729431810.git.gitgitgadget@gmail.com \
--to=gitgitgadget@gmail.com \
--cc=christian.couder@gmail.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=johannes.schindelin@gmx.de \
--cc=johncai86@gmail.com \
--cc=kristofferhaugsbakk@fastmail.com \
--cc=me@ttaylorr.com \
--cc=newren@gmail.com \
--cc=peff@peff.net \
--cc=ps@pks.im \
--cc=stolee@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).