From: Taylor Blau <me@ttaylorr.com>
To: Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com>
Cc: git@vger.kernel.org, gitster@pobox.com,
johannes.schindelin@gmx.de, peff@peff.net, ps@pks.im,
johncai86@gmail.com, newren@gmail.com,
christian.couder@gmail.com, kristofferhaugsbakk@fastmail.com,
jonathantanmy@google.com, Derrick Stolee <stolee@gmail.com>
Subject: Re: [PATCH 0/6] PATH WALK I: The path-walk API
Date: Fri, 1 Nov 2024 15:23:27 -0400 [thread overview]
Message-ID: <ZyUqr/wb5K4Og9j9@nand.local> (raw)
In-Reply-To: <pull.1818.git.1730356023.gitgitgadget@gmail.com>
Hi Stolee,
On Thu, Oct 31, 2024 at 06:26:57AM +0000, Derrick Stolee via GitGitGadget wrote:
>
> Introduction and relation to prior series
> =========================================
>
> This is a new series that rerolls the initial "path-walk API" patches of my
> RFC [1] "Path-walk API and applications". This new API (in path-walk.c and
> path-walk.h) presents a new way to walk objects such that trees and blobs
> are walked in batches according to their path.
>
> This also replaces the previous version of ds/path-walk that was being
> reviewed in [2]. The consensus was that the series was too long/dense and
> could use some reduction in size. This series takes the first few patches,
> but also makes some updates (which will be described later).
>
> [1]
> https://lore.kernel.org/git/pull.1786.git.1725935335.gitgitgadget@gmail.com/
> [RFC] Path-walk API and applications
>
> [2]
> https://lore.kernel.org/git/pull.1813.v2.git.1729431810.gitgitgadget@gmail.com/
> [PATCH v2 00/17] pack-objects: add --path-walk option for better deltas
I apologize for not having a better place to start discussing a topic
which pertains to more than just this immediate patch series, but I
figure here is as good a place as any to do so.
From our earlier discussion, it seems to stand that the path-walk API
is fundamentally incompatible with reachability bitmaps and
delta-islands, making the series a non-starter in environments that
rely significantly one or both of those features. My understanding as a
result is that the path-walk API and feature are more targeted towards
improving client-side repacks and push performance, where neither of the
aforementioned two features are used quite as commonly.
I was discussing this a bit off-list with Peff (who I hope will join the
thread and share his own thoughts), but I wonder if it was a mistake to
discard your '--full-name-hash' idea (or something similar, which I'll
discuss in a bit below) from earlier.
(Repeating a few things that I am sure are obvious to you out loud so
that I can get a grasp on them for my own understanding):
It seems that the problems you've identified which result in poor repack
performance occur when you have files at the same path, but they get
poorly sorted in the delta selection window due to other paths having
the same final 16 characters, so Git doesn't see that much better delta
opportunities exist.
Your series takes into account the full name when hashing, which seems
to produce a clear win in many cases. I'm sure that there are some cases
where it presents a modest regression in pack sizes, but I think that's
fine and probably par for the course when making any changes like this,
as there is probably no easy silver bullet here that uniformly improves
all cases.
I suspect that you could go even further and intern the full path at
which each object occurs, and sort lexically by that. Just stringing
together all of the paths in linux.git only takes 3.099 MiB on my clone.
(Of course, that's unbounded in the number of objects and length of
their pathnames, but you could at least bound the latter by taking only
the last, say, 128 characters, which would be more than good enough for
the kernel, whose longest path is only 102 characters).
Some of the repositories that you've tested on I don't have easy access
to, so I wonder if either doing (a) that, or (b) using some fancier
context-sensitive hash (like SimHash or MinHash) would be beneficial.
I realize that this is taking us back to an idea you've already
presented to the list, but I think (to me, at least) the benefit and
simplicity of that approach has only become clear to me in hindsight
when seeing some alternatives. I would like to apologize for the time
you spent reworking this series back and forth to have the response be
"maybe we should have just done the first thing you suggested". Like I
said, I think to me it was really only clear in hindsight.
In any event, the major benefit to doing --full-name-hash would be that
*all* environments could benefit from the size reduction, not just those
that don't rely on certain other features.
Perhaps just --full-name-hash isn't quite as good by itself as the
--path-walk implementation that this series starts us off implementing.
So in that sense, maybe we want both, which I understand was the
original approach. I see a couple of options here:
- We take both, because doing --path-walk on top represents a
significant enough improvement that we are collectively OK with
taking on more code to improve a more narrow (but common) use-case.
- Or we decide that either the benefit isn't significant enough to
warrant an additional and relatively complex implementation, or in
other words that --full-name-hash by itself is good enough.
Again, I apologize for not having a clearer picture of this all to start
with, and I want to tell you specifically and sincerely that I
appreciate your patience as I wrap my head around all of this. I think
the benefit of --full-name-hash is much clearer and appealing to me now
having had both more time and seeing the series approached in a couple
of different ways. Let me know what you think.
Thanks,
Taylor
next prev parent reply other threads:[~2024-11-01 19:23 UTC|newest]
Thread overview: 67+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-31 6:26 [PATCH 0/6] PATH WALK I: The path-walk API Derrick Stolee via GitGitGadget
2024-10-31 6:26 ` [PATCH 1/6] path-walk: introduce an object walk by path Derrick Stolee via GitGitGadget
2024-11-01 13:12 ` karthik nayak
2024-11-01 13:44 ` Derrick Stolee
[not found] ` <draft-87r07v14kl.fsf@archlinux.mail-host-address-is-not-set>
2024-11-01 13:42 ` karthik nayak
2024-10-31 6:26 ` [PATCH 2/6] test-lib-functions: add test_cmp_sorted Derrick Stolee via GitGitGadget
2024-10-31 6:27 ` [PATCH 3/6] t6601: add helper for testing path-walk API Derrick Stolee via GitGitGadget
2024-11-01 13:46 ` karthik nayak
2024-11-01 22:23 ` Jonathan Tan
2024-11-04 15:56 ` Derrick Stolee
2024-11-04 23:39 ` Jonathan Tan
2024-11-08 14:53 ` Derrick Stolee
2024-11-06 14:04 ` Patrick Steinhardt
2024-11-08 14:58 ` Derrick Stolee
2024-10-31 6:27 ` [PATCH 4/6] path-walk: allow consumer to specify object types Derrick Stolee via GitGitGadget
2024-10-31 6:27 ` [PATCH 5/6] path-walk: visit tags and cached objects Derrick Stolee via GitGitGadget
2024-11-01 14:25 ` karthik nayak
2024-11-04 15:56 ` Derrick Stolee
2024-10-31 6:27 ` [PATCH 6/6] path-walk: mark trees and blobs as UNINTERESTING Derrick Stolee via GitGitGadget
2024-10-31 12:36 ` [PATCH 0/6] PATH WALK I: The path-walk API Derrick Stolee
2024-11-01 19:23 ` Taylor Blau [this message]
2024-11-04 15:48 ` Derrick Stolee
2024-11-04 17:25 ` Jeff King
2024-11-05 0:11 ` Junio C Hamano
2024-11-08 15:17 ` Derrick Stolee
2024-11-11 2:56 ` Junio C Hamano
2024-11-11 13:20 ` Derrick Stolee
2024-11-11 21:55 ` Jeff King
2024-11-11 22:29 ` Junio C Hamano
2024-11-11 22:04 ` Jeff King
2024-11-09 19:41 ` [PATCH v2 " Derrick Stolee via GitGitGadget
2024-11-09 19:41 ` [PATCH v2 1/6] path-walk: introduce an object walk by path Derrick Stolee via GitGitGadget
2024-11-09 19:41 ` [PATCH v2 2/6] test-lib-functions: add test_cmp_sorted Derrick Stolee via GitGitGadget
2024-11-09 19:41 ` [PATCH v2 3/6] t6601: add helper for testing path-walk API Derrick Stolee via GitGitGadget
2024-11-21 22:39 ` Taylor Blau
2024-11-09 19:41 ` [PATCH v2 4/6] path-walk: allow consumer to specify object types Derrick Stolee via GitGitGadget
2024-11-21 22:44 ` Taylor Blau
2024-11-09 19:41 ` [PATCH v2 5/6] path-walk: visit tags and cached objects Derrick Stolee via GitGitGadget
2024-11-09 19:41 ` [PATCH v2 6/6] path-walk: mark trees and blobs as UNINTERESTING Derrick Stolee via GitGitGadget
2024-11-21 22:57 ` [PATCH v2 0/6] PATH WALK I: The path-walk API Taylor Blau
2024-11-25 8:56 ` Patrick Steinhardt
2024-11-26 7:39 ` Junio C Hamano
2024-11-26 7:43 ` Patrick Steinhardt
2024-11-26 8:16 ` Junio C Hamano
2024-12-06 19:45 ` [PATCH v3 0/7] " Derrick Stolee via GitGitGadget
2024-12-06 19:45 ` [PATCH v3 1/7] path-walk: introduce an object walk by path Derrick Stolee via GitGitGadget
2024-12-13 11:58 ` Patrick Steinhardt
2024-12-18 14:21 ` Derrick Stolee
2024-12-27 14:18 ` Patrick Steinhardt
2024-12-06 19:45 ` [PATCH v3 2/7] test-lib-functions: add test_cmp_sorted Derrick Stolee via GitGitGadget
2024-12-06 19:45 ` [PATCH v3 3/7] t6601: add helper for testing path-walk API Derrick Stolee via GitGitGadget
2024-12-06 19:45 ` [PATCH v3 4/7] path-walk: allow consumer to specify object types Derrick Stolee via GitGitGadget
2024-12-06 19:45 ` [PATCH v3 5/7] path-walk: visit tags and cached objects Derrick Stolee via GitGitGadget
2024-12-13 11:58 ` Patrick Steinhardt
2024-12-18 14:23 ` Derrick Stolee
2024-12-06 19:45 ` [PATCH v3 6/7] path-walk: mark trees and blobs as UNINTERESTING Derrick Stolee via GitGitGadget
2024-12-06 19:45 ` [PATCH v3 7/7] path-walk: reorder object visits Derrick Stolee via GitGitGadget
2024-12-13 11:58 ` [PATCH v3 0/7] PATH WALK I: The path-walk API Patrick Steinhardt
2024-12-20 16:21 ` [PATCH v4 " Derrick Stolee via GitGitGadget
2024-12-20 16:21 ` [PATCH v4 1/7] path-walk: introduce an object walk by path Derrick Stolee via GitGitGadget
2024-12-27 14:18 ` Patrick Steinhardt
2024-12-20 16:21 ` [PATCH v4 2/7] test-lib-functions: add test_cmp_sorted Derrick Stolee via GitGitGadget
2024-12-20 16:21 ` [PATCH v4 3/7] t6601: add helper for testing path-walk API Derrick Stolee via GitGitGadget
2024-12-20 16:21 ` [PATCH v4 4/7] path-walk: allow consumer to specify object types Derrick Stolee via GitGitGadget
2024-12-20 16:21 ` [PATCH v4 5/7] path-walk: visit tags and cached objects Derrick Stolee via GitGitGadget
2024-12-20 16:21 ` [PATCH v4 6/7] path-walk: mark trees and blobs as UNINTERESTING Derrick Stolee via GitGitGadget
2024-12-20 16:21 ` [PATCH v4 7/7] path-walk: reorder object visits Derrick Stolee via GitGitGadget
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZyUqr/wb5K4Og9j9@nand.local \
--to=me@ttaylorr.com \
--cc=christian.couder@gmail.com \
--cc=git@vger.kernel.org \
--cc=gitgitgadget@gmail.com \
--cc=gitster@pobox.com \
--cc=johannes.schindelin@gmx.de \
--cc=johncai86@gmail.com \
--cc=jonathantanmy@google.com \
--cc=kristofferhaugsbakk@fastmail.com \
--cc=newren@gmail.com \
--cc=peff@peff.net \
--cc=ps@pks.im \
--cc=stolee@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).