From: Taylor Blau <me@ttaylorr.com>
To: Junio C Hamano <gitster@pobox.com>
Cc: Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com>,
git@vger.kernel.org, christian.couder@gmail.com,
johannes.schindelin@gmx.de, johncai86@gmail.com,
jonathantanmy@google.com, karthik.188@gmail.com,
kristofferhaugsbakk@fastmail.com, newren@gmail.com,
peff@peff.net, ps@pks.im, Derrick Stolee <stolee@gmail.com>
Subject: Re: [PATCH 00/13] PATH WALK II: Add --path-walk option to 'git pack-objects'
Date: Wed, 12 Mar 2025 16:47:37 -0400 [thread overview]
Message-ID: <Z9Hy6Yk2XM1RCsNC@nand.local> (raw)
In-Reply-To: <xmqqwmcw7q2x.fsf@gitster.g>
On Mon, Mar 10, 2025 at 10:28:22AM -0700, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > ... deltas across path boundaries. This second pass is much faster than a fresh
> > pass since the existing deltas are used as a limit for the size of
> > potentially new deltas, short-circuiting the checks when the delta size
> > exceeds the current-best.
>
> Very nice.
>
> > The microsoft/fluentui is a public Javascript repo that suffers from many of
> > the name hash collisions as internal repositories I've worked with. Here is
> > a comparison of the compressed size and end-to-end time of the repack:
> >
> > Repack Method Pack Size Time
> > ---------------------------------------
> > Hash v1 439.4M 87.24s
> > Hash v2 161.7M 21.51s
> > Path Walk 142.5M 28.16s
OK, so microsoft/fluentui benefits from the path-walk approach in the
size of the resulting pack, but at the cost of additional time to
generate it.
> > Less dramatic, but perhaps more standardly structured is the nodejs/node
> > repository, with these stats:
> >
> > Repack Method Pack Size Time
> > ------------------------------------------
> > Hash v1 739.9M 71.18s
> > Hash v2 764.6M 67.82s
> > Path Walk 698.0M 75.10s
Same here.
> > Even the Linux kernel repository gains some benefits, even though the number
> > of hash collisions is relatively low due to a preference for short
> > filenames:
> >
> > Repack Method Pack Size Time
> > ------------------------------------------
> > Hash v1 2.5G 554.41s
> > Hash v2 2.5G 549.62s
> > Path Walk 2.2G 559.00s
OK, so here the savings are a little more substantial, and the
performance hit isn't too bad.
> This third one, v2 not performing much better than v1, is quite
> surprising.
I'm not sure... I think Stolee's "the number of hash collisions is
relatively low due to preference for short filenames" is why v2 behaves
so similarly to v1 here.
> > The drawbacks of the --path-walk feature is that it will be harder to
> > integrate it with bitmap features, specifically delta islands. This is not
> > insurmountable, but would require more work, such as a revision walk to
> > paint objects with reachability information before using that during delta
> > computations.
> >
> > However, there should still be significant benefits to Git clients trying to
> > save space and improve local performance.
>
> Sure. More experiments and more approaches will eventually give us
> overall improvement. I am hoping that we will be able to condense
> the result of these different approaches and their combinations into
> easy-to-choose-from canned choices (as opposed to a myriad of little
> knobs the users need to futz with without really understanding what
> they are tweaking).
In the above three examples we see some trade-offs between pack size and
the time it took to generate it. I think it's worth discussing whether
or not the potential benefit of such a trade-off is worth the
significant complexity and code that this feature will introduce. (To be
clear, I don't have a strong opinion here one way or the other, but I do
think that it's at least worth discussing).
I wonder how much of the benefits of path-walk over the hash v2 approach
could be had by simply widening the pack.window during delta selection?
I tried to run a similar experiment as you did above on the
microsoft/fluentui repository and got the following:
Repack Method Pack Size Time
------------------------------------------
Hash v1 447.2MiB 932.41s
Hash v2 154.1MiB 404.35s
Hash v2 (window=20) 146.7MiB 472.66s
Hash v2 (window=50) 138.3MiB 622.13s
Path Walk 140.8MiB 168.86s
In your experiment above on the same repository, the path walk feature
represents an 11.873% reduction in pack size, but at the cost of a 30.9%
regression in runtime.
When I set pack.window to "50" (over the default value of "10"), I get a
~10.3% reduction in pack size at the cost of a 54% increase in runtime
(relative to just --name-hash-version=2 with the default pack.window
settings).
But when I set the pack.window to "20", the relative values (again
comparing against --name-hash-version=2 with the default pack.window)
are 4.8% reduction in pack size and a 16.9% increase in runtime.
But these numbers are pretty confusing to me, TBH. The reduction in pack
sizes makes sense, and here I see numbers that are on-par with what you
noted above for the same repository. But the runtimes are wildly
different (e.g., hash v1 takes you just 87s while mine takes 932s).
There must be something in our environment that is different. I'm
starting with a bare clone of microsoft/fluentui from GitHub, and made
several 'cp -al' copies of it for the different experiments. In the
penultimate one, I ran:
$ time git.compile -c pack.window=50 repack --name-hash-version=2 \
-adF --no-write-bitmap-index
, and similarly for the other experiments with appropriate values for
pack.window, --name-hash-version, and --path-walk, when applicable. All
of this was done on a -O2 build of Git with your patches on top.
So I'm not sure what to make of these results. Clearly on my machine
something is different that makes path-walk much faster than hash v2.
But on your machine it's slower, so I don't know how much I trust the
timing results from either machine.
In any event, it seems like at least in this example we can get
performance that is on-par with path-walk by simply widening the
pack.window when using hash v2. On my machine that seems to cost more
time than it does for you to the point where it's slower than my
path-walk. But I think I need to understand what the differences are
here before we can draw any conclusions on the size or timing.
If the overwhelming majority of cases where the --path-walk feature
presents a significant benefit over hash v2 at various pack.window sizes
(where we could get approximately the same reduction in pack size with
approximately the same end-to-end runtime of 'git repack'), then I feel
we might want to reconsider whether or not the complexity of this feature
is worthwhile.
But if the --path-walk feature either gives us a significant size
benefit that we can't get with hash v2 and a wider pack.window without
paying a significant runtime cost (or vice-versa), then this feature
would indeed be worthwhile.
I also have no idea how representative the above is of your intended
use-case, which seems much more oriented around pushes than from-scratch
repacks, which would also affect our conclusions here.
Thanks,
Taylor
next prev parent reply other threads:[~2025-03-12 20:47 UTC|newest]
Thread overview: 75+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-03-10 1:50 [PATCH 00/13] PATH WALK II: Add --path-walk option to 'git pack-objects' Derrick Stolee via GitGitGadget
2025-03-10 1:50 ` [PATCH 01/13] pack-objects: extract should_attempt_deltas() Derrick Stolee via GitGitGadget
2025-03-12 21:01 ` Taylor Blau
2025-03-20 19:48 ` Derrick Stolee
2025-03-10 1:50 ` [PATCH 02/13] pack-objects: add --path-walk option Derrick Stolee via GitGitGadget
2025-03-12 21:14 ` Taylor Blau
2025-03-20 19:46 ` Derrick Stolee
2025-03-10 1:50 ` [PATCH 03/13] pack-objects: update usage to match docs Derrick Stolee via GitGitGadget
2025-03-12 21:14 ` Taylor Blau
2025-03-10 1:50 ` [PATCH 04/13] p5313: add performance tests for --path-walk Derrick Stolee via GitGitGadget
2025-03-10 1:50 ` [PATCH 05/13] pack-objects: introduce GIT_TEST_PACK_PATH_WALK Derrick Stolee via GitGitGadget
2025-03-10 1:50 ` [PATCH 06/13] t5538: add tests to confirm deltas in shallow pushes Derrick Stolee via GitGitGadget
2025-03-10 1:50 ` [PATCH 07/13] repack: add --path-walk option Derrick Stolee via GitGitGadget
2025-03-10 1:50 ` [PATCH 08/13] pack-objects: enable --path-walk via config Derrick Stolee via GitGitGadget
2025-03-10 1:50 ` [PATCH 09/13] scalar: enable path-walk during push " Derrick Stolee via GitGitGadget
2025-03-10 1:50 ` [PATCH 10/13] pack-objects: refactor path-walk delta phase Derrick Stolee via GitGitGadget
2025-03-12 21:21 ` Taylor Blau
2025-03-20 19:57 ` Derrick Stolee
2025-03-10 1:50 ` [PATCH 11/13] pack-objects: thread the path-based compression Derrick Stolee via GitGitGadget
2025-03-10 1:50 ` [PATCH 12/13] path-walk: add new 'edge_aggressive' option Derrick Stolee via GitGitGadget
2025-03-10 1:50 ` [PATCH 13/13] pack-objects: allow --shallow and --path-walk Derrick Stolee via GitGitGadget
2025-03-10 17:28 ` [PATCH 00/13] PATH WALK II: Add --path-walk option to 'git pack-objects' Junio C Hamano
2025-03-12 20:47 ` Taylor Blau [this message]
2025-03-20 20:18 ` Derrick Stolee
2025-03-24 15:22 ` [PATCH v2 " Derrick Stolee via GitGitGadget
2025-03-24 15:22 ` [PATCH v2 01/13] pack-objects: extract should_attempt_deltas() Derrick Stolee via GitGitGadget
2025-05-02 22:48 ` Taylor Blau
2025-03-24 15:22 ` [PATCH v2 02/13] pack-objects: add --path-walk option Derrick Stolee via GitGitGadget
2025-05-02 23:21 ` Taylor Blau
2025-05-06 19:39 ` Derrick Stolee
2025-05-16 15:27 ` Derrick Stolee
2025-03-24 15:22 ` [PATCH v2 03/13] pack-objects: update usage to match docs Derrick Stolee via GitGitGadget
2025-03-24 15:22 ` [PATCH v2 04/13] p5313: add performance tests for --path-walk Derrick Stolee via GitGitGadget
2025-05-02 23:25 ` Taylor Blau
2025-03-24 15:22 ` [PATCH v2 05/13] pack-objects: introduce GIT_TEST_PACK_PATH_WALK Derrick Stolee via GitGitGadget
2025-05-02 23:31 ` Taylor Blau
2025-05-06 19:43 ` Derrick Stolee
2025-03-24 15:22 ` [PATCH v2 06/13] t5538: add tests to confirm deltas in shallow pushes Derrick Stolee via GitGitGadget
2025-05-02 23:34 ` Taylor Blau
2025-05-16 15:32 ` Derrick Stolee
2025-03-24 15:22 ` [PATCH v2 07/13] repack: add --path-walk option Derrick Stolee via GitGitGadget
2025-05-02 23:38 ` Taylor Blau
2025-03-24 15:22 ` [PATCH v2 08/13] pack-objects: enable --path-walk via config Derrick Stolee via GitGitGadget
2025-05-02 23:42 ` Taylor Blau
2025-05-06 19:46 ` Derrick Stolee
2025-05-16 15:41 ` Derrick Stolee
2025-03-24 15:22 ` [PATCH v2 09/13] scalar: enable path-walk during push " Derrick Stolee via GitGitGadget
2025-05-07 0:58 ` Taylor Blau
2025-03-24 15:22 ` [PATCH v2 10/13] pack-objects: refactor path-walk delta phase Derrick Stolee via GitGitGadget
2025-05-07 1:14 ` Taylor Blau
2025-05-16 16:27 ` Derrick Stolee
2025-05-29 0:17 ` Taylor Blau
2025-03-24 15:22 ` [PATCH v2 11/13] pack-objects: thread the path-based compression Derrick Stolee via GitGitGadget
2025-05-07 1:33 ` Taylor Blau
2025-03-24 15:22 ` [PATCH v2 12/13] path-walk: add new 'edge_aggressive' option Derrick Stolee via GitGitGadget
2025-03-24 15:22 ` [PATCH v2 13/13] pack-objects: allow --shallow and --path-walk Derrick Stolee via GitGitGadget
2025-05-02 21:24 ` [PATCH v2 00/13] PATH WALK II: Add --path-walk option to 'git pack-objects' Junio C Hamano
2025-05-02 22:45 ` Taylor Blau
2025-05-02 23:44 ` Taylor Blau
2025-05-07 1:35 ` Taylor Blau
2025-05-16 18:11 ` [PATCH v3 " Derrick Stolee via GitGitGadget
2025-05-16 18:11 ` [PATCH v3 01/13] pack-objects: extract should_attempt_deltas() Derrick Stolee via GitGitGadget
2025-05-16 18:11 ` [PATCH v3 02/13] pack-objects: add --path-walk option Derrick Stolee via GitGitGadget
2025-05-16 18:11 ` [PATCH v3 03/13] pack-objects: update usage to match docs Derrick Stolee via GitGitGadget
2025-05-16 18:11 ` [PATCH v3 04/13] p5313: add performance tests for --path-walk Derrick Stolee via GitGitGadget
2025-05-16 18:11 ` [PATCH v3 05/13] pack-objects: introduce GIT_TEST_PACK_PATH_WALK Derrick Stolee via GitGitGadget
2025-05-16 18:11 ` [PATCH v3 06/13] t5538: add tests to confirm deltas in shallow pushes Derrick Stolee via GitGitGadget
2025-05-16 18:11 ` [PATCH v3 07/13] repack: add --path-walk option Derrick Stolee via GitGitGadget
2025-05-16 18:11 ` [PATCH v3 08/13] pack-objects: enable --path-walk via config Derrick Stolee via GitGitGadget
2025-05-16 18:11 ` [PATCH v3 09/13] scalar: enable path-walk during push " Derrick Stolee via GitGitGadget
2025-05-16 18:12 ` [PATCH v3 10/13] pack-objects: refactor path-walk delta phase Derrick Stolee via GitGitGadget
2025-05-16 18:12 ` [PATCH v3 11/13] pack-objects: thread the path-based compression Derrick Stolee via GitGitGadget
2025-05-16 18:12 ` [PATCH v3 12/13] path-walk: add new 'edge_aggressive' option Derrick Stolee via GitGitGadget
2025-05-16 18:12 ` [PATCH v3 13/13] pack-objects: allow --shallow and --path-walk Derrick Stolee via GitGitGadget
2025-05-29 0:20 ` [PATCH v3 00/13] PATH WALK II: Add --path-walk option to 'git pack-objects' Taylor Blau
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Z9Hy6Yk2XM1RCsNC@nand.local \
--to=me@ttaylorr.com \
--cc=christian.couder@gmail.com \
--cc=git@vger.kernel.org \
--cc=gitgitgadget@gmail.com \
--cc=gitster@pobox.com \
--cc=johannes.schindelin@gmx.de \
--cc=johncai86@gmail.com \
--cc=jonathantanmy@google.com \
--cc=karthik.188@gmail.com \
--cc=kristofferhaugsbakk@fastmail.com \
--cc=newren@gmail.com \
--cc=peff@peff.net \
--cc=ps@pks.im \
--cc=stolee@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).