git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Derrick Stolee <stolee@gmail.com>
To: Taylor Blau <me@ttaylorr.com>, Junio C Hamano <gitster@pobox.com>
Cc: Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com>,
	git@vger.kernel.org, christian.couder@gmail.com,
	johannes.schindelin@gmx.de, johncai86@gmail.com,
	jonathantanmy@google.com, karthik.188@gmail.com,
	kristofferhaugsbakk@fastmail.com, newren@gmail.com,
	peff@peff.net, ps@pks.im
Subject: Re: [PATCH 00/13] PATH WALK II: Add --path-walk option to 'git pack-objects'
Date: Thu, 20 Mar 2025 16:18:32 -0400	[thread overview]
Message-ID: <192714d0-d351-405b-9186-367c5212aeca@gmail.com> (raw)
In-Reply-To: <Z9Hy6Yk2XM1RCsNC@nand.local>

On 3/12/2025 4:47 PM, Taylor Blau wrote:
> On Mon, Mar 10, 2025 at 10:28:22AM -0700, Junio C Hamano wrote:
>> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> In the above three examples we see some trade-offs between pack size and
> the time it took to generate it. I think it's worth discussing whether
> or not the potential benefit of such a trade-off is worth the
> significant complexity and code that this feature will introduce. (To be
> clear, I don't have a strong opinion here one way or the other, but I do
> think that it's at least worth discussing).
> 
> I wonder how much of the benefits of path-walk over the hash v2 approach
> could be had by simply widening the pack.window during delta selection?
> 
> I tried to run a similar experiment as you did above on the
> microsoft/fluentui repository and got the following:
> 
>     Repack Method       Pack Size       Time
>     ------------------------------------------
>     Hash v1              447.2MiB      932.41s
>     Hash v2              154.1MiB      404.35s
>     Hash v2 (window=20)  146.7MiB      472.66s
>     Hash v2 (window=50)  138.3MiB      622.13s
>     Path Walk            140.8MiB      168.86s
> 
> In your experiment above on the same repository, the path walk feature
> represents an 11.873% reduction in pack size, but at the cost of a 30.9%
> regression in runtime.
> 
> When I set pack.window to "50" (over the default value of "10"), I get a
> ~10.3% reduction in pack size at the cost of a 54% increase in runtime
> (relative to just --name-hash-version=2 with the default pack.window
> settings).
> 
> But when I set the pack.window to "20", the relative values (again
> comparing against --name-hash-version=2 with the default pack.window)
> are 4.8% reduction in pack size and a 16.9% increase in runtime.

You're right that I wasn't including data around the --window option in
my analysis. This option presents folks with the opportunity to add CPU
time in order to improve the possibility of better compression due to
considering more object pairs.

But it's also important to note that that option still works with
--path-walk, except that the --path-walk option is focused on improving
the quality of objects being considered within a window. There's also the
aspect that there are two passes (one path-based and one name-hash-based)
so increasing the --window size has a larger impact on the --path-walk
option.

With regards to the microsoft/fluentui repo, I had previously been using
an old clone using --bare. The size changes if I use --mirror as well,
since it will get the fork hint refs corresponding to objects in public
forks that are not actually in the core repo. This changes the clone size
as well as the repacked size.

(To save time, I didn't repeat the --window option tests for name hash
v1 as name hash v2 is clearly superior to that option in this repo.)

Cloned with --bare:

| Type         | Window: 10     | Window: 20     | Window: 50     |
|--------------|----------------|----------------|----------------|
| name hash v1 | 451 M | 1m 42s |       |        |       |        |
| name hash v2 | 160 M | 35.4 s | 151 M | 25.4 s | 141 M | 31.0 s |
| --path-walk  | 141 M | 31.0 s | 136 M | 35.7 s | 129 M | 49.3 s |

Cloned with --mirror:

| Type         | Window: 10     | Window: 20     | Window: 50      |
|--------------|----------------|----------------|-----------------|
| name hash v1 | 882 M | 3m 27s |       |        |       |         |
| name hash v2 | 584 M | 70.4 s | 554 M | 54.6 s | 530 M | 69.4 s  |
| --path-walk  | 548 M | 79.8 s | 523 M | 93.9 s | 507 M | 126.2 s |

Running on a slightly-larger Javascript repo with the same CHANGLOG
filename issue, I get these results:

| Type         | Window: 10     | Window: 20     | Window: 50     |
|--------------|----------------|----------------|----------------|
| name hash v1 | 6.4 G | 36m 9s |       |        |       |        |
| name hash v2 | 920 M | 7m 39s | 767 M | 5m 49s | 665 M | 6m 12s |
| --path-walk  | 834 M | 4m 48s | 697 M | 7m 39s | 615 M | 8m 42s |

> But these numbers are pretty confusing to me, TBH. The reduction in pack
> sizes makes sense, and here I see numbers that are on-par with what you
> noted above for the same repository. But the runtimes are wildly
> different (e.g., hash v1 takes you just 87s while mine takes 932s).

I wonder if it's related to threading? I'm using as many cores as I can.

> There must be something in our environment that is different. I'm
> starting with a bare clone of microsoft/fluentui from GitHub, and made
> several 'cp -al' copies of it for the different experiments. In the
> penultimate one, I ran:
> 
>     $ time git.compile -c pack.window=50 repack --name-hash-version=2 \
>         -adF --no-write-bitmap-index

There's also some strange things with my numbers because I'm not copying
the same data into multiple places but instead running the test on the
same repo. Thus, the "input size" is changing with each run and this is
probably a big factor in the larger tests.

So the order in my tables is left-to-right, top-to-bottom, like reading
a page in English. Thus, the short time for --path-walk --window=10 in
the last example is maybe a bit faster because it is starting from the
665 M from the --name-hash-version=2 --window=50 example. 
> In any event, it seems like at least in this example we can get
> performance that is on-par with path-walk by simply widening the
> pack.window when using hash v2. On my machine that seems to cost more
> time than it does for you to the point where it's slower than my
> path-walk. But I think I need to understand what the differences are
> here before we can draw any conclusions on the size or timing.

I'd be very curious to see if more folks have bandwidth to do similar
testing. My default mode is that I like giving users more options to
explore which may work better for them. 
> If the overwhelming majority of cases where the --path-walk feature
> presents a significant benefit over hash v2 at various pack.window sizes
> (where we could get approximately the same reduction in pack size with
> approximately the same end-to-end runtime of 'git repack'), then I feel
> we might want to reconsider whether or not the complexity of this feature
> is worthwhile.
> 
> But if the --path-walk feature either gives us a significant size
> benefit that we can't get with hash v2 and a wider pack.window without
> paying a significant runtime cost (or vice-versa), then this feature
> would indeed be worthwhile.
> 
> I also have no idea how representative the above is of your intended
> use-case, which seems much more oriented around pushes than from-scratch
> repacks, which would also affect our conclusions here.The push story is valuable, but I'm also interested in helping users shrink their local repositories in whatever means they are
willing to wait for.

---

Meta-response to your patch review: I have made adjustments to my local
branch in response to the points you brought up. I'll hold off on v2
for a few more days to give more opportunity for review.

Thanks,
-Stolee


  reply	other threads:[~2025-03-20 20:18 UTC|newest]

Thread overview: 75+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-10  1:50 [PATCH 00/13] PATH WALK II: Add --path-walk option to 'git pack-objects' Derrick Stolee via GitGitGadget
2025-03-10  1:50 ` [PATCH 01/13] pack-objects: extract should_attempt_deltas() Derrick Stolee via GitGitGadget
2025-03-12 21:01   ` Taylor Blau
2025-03-20 19:48     ` Derrick Stolee
2025-03-10  1:50 ` [PATCH 02/13] pack-objects: add --path-walk option Derrick Stolee via GitGitGadget
2025-03-12 21:14   ` Taylor Blau
2025-03-20 19:46     ` Derrick Stolee
2025-03-10  1:50 ` [PATCH 03/13] pack-objects: update usage to match docs Derrick Stolee via GitGitGadget
2025-03-12 21:14   ` Taylor Blau
2025-03-10  1:50 ` [PATCH 04/13] p5313: add performance tests for --path-walk Derrick Stolee via GitGitGadget
2025-03-10  1:50 ` [PATCH 05/13] pack-objects: introduce GIT_TEST_PACK_PATH_WALK Derrick Stolee via GitGitGadget
2025-03-10  1:50 ` [PATCH 06/13] t5538: add tests to confirm deltas in shallow pushes Derrick Stolee via GitGitGadget
2025-03-10  1:50 ` [PATCH 07/13] repack: add --path-walk option Derrick Stolee via GitGitGadget
2025-03-10  1:50 ` [PATCH 08/13] pack-objects: enable --path-walk via config Derrick Stolee via GitGitGadget
2025-03-10  1:50 ` [PATCH 09/13] scalar: enable path-walk during push " Derrick Stolee via GitGitGadget
2025-03-10  1:50 ` [PATCH 10/13] pack-objects: refactor path-walk delta phase Derrick Stolee via GitGitGadget
2025-03-12 21:21   ` Taylor Blau
2025-03-20 19:57     ` Derrick Stolee
2025-03-10  1:50 ` [PATCH 11/13] pack-objects: thread the path-based compression Derrick Stolee via GitGitGadget
2025-03-10  1:50 ` [PATCH 12/13] path-walk: add new 'edge_aggressive' option Derrick Stolee via GitGitGadget
2025-03-10  1:50 ` [PATCH 13/13] pack-objects: allow --shallow and --path-walk Derrick Stolee via GitGitGadget
2025-03-10 17:28 ` [PATCH 00/13] PATH WALK II: Add --path-walk option to 'git pack-objects' Junio C Hamano
2025-03-12 20:47   ` Taylor Blau
2025-03-20 20:18     ` Derrick Stolee [this message]
2025-03-24 15:22 ` [PATCH v2 " Derrick Stolee via GitGitGadget
2025-03-24 15:22   ` [PATCH v2 01/13] pack-objects: extract should_attempt_deltas() Derrick Stolee via GitGitGadget
2025-05-02 22:48     ` Taylor Blau
2025-03-24 15:22   ` [PATCH v2 02/13] pack-objects: add --path-walk option Derrick Stolee via GitGitGadget
2025-05-02 23:21     ` Taylor Blau
2025-05-06 19:39       ` Derrick Stolee
2025-05-16 15:27         ` Derrick Stolee
2025-03-24 15:22   ` [PATCH v2 03/13] pack-objects: update usage to match docs Derrick Stolee via GitGitGadget
2025-03-24 15:22   ` [PATCH v2 04/13] p5313: add performance tests for --path-walk Derrick Stolee via GitGitGadget
2025-05-02 23:25     ` Taylor Blau
2025-03-24 15:22   ` [PATCH v2 05/13] pack-objects: introduce GIT_TEST_PACK_PATH_WALK Derrick Stolee via GitGitGadget
2025-05-02 23:31     ` Taylor Blau
2025-05-06 19:43       ` Derrick Stolee
2025-03-24 15:22   ` [PATCH v2 06/13] t5538: add tests to confirm deltas in shallow pushes Derrick Stolee via GitGitGadget
2025-05-02 23:34     ` Taylor Blau
2025-05-16 15:32       ` Derrick Stolee
2025-03-24 15:22   ` [PATCH v2 07/13] repack: add --path-walk option Derrick Stolee via GitGitGadget
2025-05-02 23:38     ` Taylor Blau
2025-03-24 15:22   ` [PATCH v2 08/13] pack-objects: enable --path-walk via config Derrick Stolee via GitGitGadget
2025-05-02 23:42     ` Taylor Blau
2025-05-06 19:46       ` Derrick Stolee
2025-05-16 15:41         ` Derrick Stolee
2025-03-24 15:22   ` [PATCH v2 09/13] scalar: enable path-walk during push " Derrick Stolee via GitGitGadget
2025-05-07  0:58     ` Taylor Blau
2025-03-24 15:22   ` [PATCH v2 10/13] pack-objects: refactor path-walk delta phase Derrick Stolee via GitGitGadget
2025-05-07  1:14     ` Taylor Blau
2025-05-16 16:27       ` Derrick Stolee
2025-05-29  0:17         ` Taylor Blau
2025-03-24 15:22   ` [PATCH v2 11/13] pack-objects: thread the path-based compression Derrick Stolee via GitGitGadget
2025-05-07  1:33     ` Taylor Blau
2025-03-24 15:22   ` [PATCH v2 12/13] path-walk: add new 'edge_aggressive' option Derrick Stolee via GitGitGadget
2025-03-24 15:22   ` [PATCH v2 13/13] pack-objects: allow --shallow and --path-walk Derrick Stolee via GitGitGadget
2025-05-02 21:24   ` [PATCH v2 00/13] PATH WALK II: Add --path-walk option to 'git pack-objects' Junio C Hamano
2025-05-02 22:45     ` Taylor Blau
2025-05-02 23:44       ` Taylor Blau
2025-05-07  1:35         ` Taylor Blau
2025-05-16 18:11   ` [PATCH v3 " Derrick Stolee via GitGitGadget
2025-05-16 18:11     ` [PATCH v3 01/13] pack-objects: extract should_attempt_deltas() Derrick Stolee via GitGitGadget
2025-05-16 18:11     ` [PATCH v3 02/13] pack-objects: add --path-walk option Derrick Stolee via GitGitGadget
2025-05-16 18:11     ` [PATCH v3 03/13] pack-objects: update usage to match docs Derrick Stolee via GitGitGadget
2025-05-16 18:11     ` [PATCH v3 04/13] p5313: add performance tests for --path-walk Derrick Stolee via GitGitGadget
2025-05-16 18:11     ` [PATCH v3 05/13] pack-objects: introduce GIT_TEST_PACK_PATH_WALK Derrick Stolee via GitGitGadget
2025-05-16 18:11     ` [PATCH v3 06/13] t5538: add tests to confirm deltas in shallow pushes Derrick Stolee via GitGitGadget
2025-05-16 18:11     ` [PATCH v3 07/13] repack: add --path-walk option Derrick Stolee via GitGitGadget
2025-05-16 18:11     ` [PATCH v3 08/13] pack-objects: enable --path-walk via config Derrick Stolee via GitGitGadget
2025-05-16 18:11     ` [PATCH v3 09/13] scalar: enable path-walk during push " Derrick Stolee via GitGitGadget
2025-05-16 18:12     ` [PATCH v3 10/13] pack-objects: refactor path-walk delta phase Derrick Stolee via GitGitGadget
2025-05-16 18:12     ` [PATCH v3 11/13] pack-objects: thread the path-based compression Derrick Stolee via GitGitGadget
2025-05-16 18:12     ` [PATCH v3 12/13] path-walk: add new 'edge_aggressive' option Derrick Stolee via GitGitGadget
2025-05-16 18:12     ` [PATCH v3 13/13] pack-objects: allow --shallow and --path-walk Derrick Stolee via GitGitGadget
2025-05-29  0:20     ` [PATCH v3 00/13] PATH WALK II: Add --path-walk option to 'git pack-objects' Taylor Blau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=192714d0-d351-405b-9186-367c5212aeca@gmail.com \
    --to=stolee@gmail.com \
    --cc=christian.couder@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=gitster@pobox.com \
    --cc=johannes.schindelin@gmx.de \
    --cc=johncai86@gmail.com \
    --cc=jonathantanmy@google.com \
    --cc=karthik.188@gmail.com \
    --cc=kristofferhaugsbakk@fastmail.com \
    --cc=me@ttaylorr.com \
    --cc=newren@gmail.com \
    --cc=peff@peff.net \
    --cc=ps@pks.im \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).