git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Derrick Stolee <stolee@gmail.com>
To: Taylor Blau <me@ttaylorr.com>,
	Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com>
Cc: git@vger.kernel.org, gitster@pobox.com,
	johannes.schindelin@gmx.de, peff@peff.net, ps@pks.im,
	johncai86@gmail.com, newren@gmail.com,
	christian.couder@gmail.com, kristofferhaugsbakk@fastmail.com,
	jonathantanmy@google.com
Subject: Re: [PATCH 0/6] PATH WALK I: The path-walk API
Date: Mon, 4 Nov 2024 10:48:49 -0500	[thread overview]
Message-ID: <2d2940ef-0b26-4060-90b6-9b6969f23754@gmail.com> (raw)
In-Reply-To: <ZyUqr/wb5K4Og9j9@nand.local>

On 11/1/24 3:23 PM, Taylor Blau wrote:
> Hi Stolee,
> 
> On Thu, Oct 31, 2024 at 06:26:57AM +0000, Derrick Stolee via GitGitGadget wrote:
>>
>> Introduction and relation to prior series
>> =========================================
>>
>> This is a new series that rerolls the initial "path-walk API" patches of my
>> RFC [1] "Path-walk API and applications". This new API (in path-walk.c and
>> path-walk.h) presents a new way to walk objects such that trees and blobs
>> are walked in batches according to their path.
>>
>> This also replaces the previous version of ds/path-walk that was being
>> reviewed in [2]. The consensus was that the series was too long/dense and
>> could use some reduction in size. This series takes the first few patches,
>> but also makes some updates (which will be described later).
>>
>> [1]
>> https://lore.kernel.org/git/pull.1786.git.1725935335.gitgitgadget@gmail.com/
>> [RFC] Path-walk API and applications
>>
>> [2]
>> https://lore.kernel.org/git/pull.1813.v2.git.1729431810.gitgitgadget@gmail.com/
>> [PATCH v2 00/17] pack-objects: add --path-walk option for better deltas
> 
> I apologize for not having a better place to start discussing a topic
> which pertains to more than just this immediate patch series, but I
> figure here is as good a place as any to do so.
> 
>  From our earlier discussion, it seems to stand that the path-walk API
> is fundamentally incompatible with reachability bitmaps and
> delta-islands, making the series a non-starter in environments that
> rely significantly one or both of those features. My understanding as a
> result is that the path-walk API and feature are more targeted towards
> improving client-side repacks and push performance, where neither of the
> aforementioned two features are used quite as commonly.

This is correct. I would go even farther to say that this approach was
designed first and foremost for Git clients and specifically their
performance while computing a thin packfile during "git push". The same
logic to help the push case happens to also help the "git repack" case
significantly.

> I was discussing this a bit off-list with Peff (who I hope will join the
> thread and share his own thoughts), but I wonder if it was a mistake to
> discard your '--full-name-hash' idea (or something similar, which I'll
> discuss in a bit below) from earlier.

I'd be happy to resurrect that series, adding in the learnings from
working on the path-walk feature. It helps that the current series adds
the path-walk API and has no conflicting changes in the pack-objects or
repack builtins. (I can handle those conflicts as things merge down.)

> (Repeating a few things that I am sure are obvious to you out loud so
> that I can get a grasp on them for my own understanding):
> 
> It seems that the problems you've identified which result in poor repack
> performance occur when you have files at the same path, but they get
> poorly sorted in the delta selection window due to other paths having
> the same final 16 characters, so Git doesn't see that much better delta
> opportunities exist.
> 
> Your series takes into account the full name when hashing, which seems
> to produce a clear win in many cases. I'm sure that there are some cases
> where it presents a modest regression in pack sizes, but I think that's
> fine and probably par for the course when making any changes like this,
> as there is probably no easy silver bullet here that uniformly improves
> all cases.
> 
> I suspect that you could go even further and intern the full path at
> which each object occurs, and sort lexically by that. Just stringing
> together all of the paths in linux.git only takes 3.099 MiB on my clone.
> (Of course, that's unbounded in the number of objects and length of
> their pathnames, but you could at least bound the latter by taking only
> the last, say, 128 characters, which would be more than good enough for
> the kernel, whose longest path is only 102 characters).

When the optimization idea is to focus on the full path and not care
about the "locality" of the path name by its later bits, storing the
full name in a list and storing an index into that list would have a
very similar effect.

I'd be interested to explore the idea of storing the full path name.
Based on my exploration with the 'test-tool name-hash' test helper in
that series, I'm not sure that we will make significant gains by doing
so. Worth trying.

> Some of the repositories that you've tested on I don't have easy access
> to, so I wonder if either doing (a) that, or (b) using some fancier
> context-sensitive hash (like SimHash or MinHash) would be beneficial.

I don't know too much about SimHash or MinHash, but based on what I
could gather from some initial reading I'm not sure that they would be
effective without increasing the hash length. We'd also get a different
kind of locality, such as the appearance of a common word would be more
likely to affect the locality than the end of the path.

The size of the hash could probably be mitigated by storing it in the
list of all full paths and accessing them from the index stored on each
to-pack object.

> I realize that this is taking us back to an idea you've already
> presented to the list, but I think (to me, at least) the benefit and
> simplicity of that approach has only become clear to me in hindsight
> when seeing some alternatives. I would like to apologize for the time
> you spent reworking this series back and forth to have the response be
> "maybe we should have just done the first thing you suggested". Like I
> said, I think to me it was really only clear in hindsight.

I always assumed that we'd come back to it eventually. There is also the
extra bit about making the change to the name-hash compatible with the
way name-hashes are stored in the reachability bitmaps. That will need
some work before it is ready for prime time.

> In any event, the major benefit to doing --full-name-hash would be that
> *all* environments could benefit from the size reduction, not just those
> that don't rely on certain other features.

I disagree that all environments will prefer the --full-name-hash. I'm
currently repeating the performance tests right now, and I've added one.
The issues are:

  1. The --full-name-hash approach sometimes leads to a larger pack when
     using "git push" on the client, especially when the name-hash is
     already effective for compressing across paths.

  2. A depth 1 shallow clone cannot use previous versions of a path, so
     those situations will want to use the normal name hash. This can be
     accomplished simply by disabling the --full-name-hash option when
     the --shallow option is present; a more detailed version could be
     used to check for a large depth before disabling it. This case also
     disables bitmaps, so that isn't something to worry about.

> Perhaps just --full-name-hash isn't quite as good by itself as the
> --path-walk implementation that this series starts us off implementing.
> So in that sense, maybe we want both, which I understand was the
> original approach. I see a couple of options here:
> 
>    - We take both, because doing --path-walk on top represents a
>      significant enough improvement that we are collectively OK with
>      taking on more code to improve a more narrow (but common) use-case.

Doing both doesn't help at all, since the --path-walk approach already
batches by the full path name. The --path-walk approach has a significant
benefit by doing a second pass by the standard name-hash to pick up on the
cross-path deltas. This is why the --path-walk approach with the standard
name hash as consistently provided the most-compact pack-files in all
tests.

   Aside: there were some initial tests that showed the --path-walk option
   led to slightly larger packfiles, but I've since discovered that those
   cases were due to an incorrect walking of indexed paths. This is fixed
   by the code in patch 5 of the current series and my WIP patches in [3]
   have the performance numbers with this change.

[3] https://github.com/gitgitgadget/git/pull/1819
PATH WALK II: Add --path-walk option to 'git pack-objects'

>    - Or we decide that either the benefit isn't significant enough to
>      warrant an additional and relatively complex implementation, or in
>      other words that --full-name-hash by itself is good enough.

I hope that I've sufficiently communicated that --full-name-hash is not
good enough by itself.

The point I was trying to make by submitting it first was that I believed
it was likely the easiest way for Git servers to gain 90% of the benefits
that the --path-walk approach provides while making it relatively easy to
integrate with other server-side features such as bitmaps and delta islands.

(Maybe the --path-walk approach could also be extended to be compatible
with those features, but it would be a significant investment that rebuilds
those features within the context of the new object walk instead of relying
on the existing implementations. That could easily be a blocker.)

> Again, I apologize for not having a clearer picture of this all to start
> with, and I want to tell you specifically and sincerely that I
> appreciate your patience as I wrap my head around all of this. I think
> the benefit of --full-name-hash is much clearer and appealing to me now
> having had both more time and seeing the series approached in a couple
> of different ways. Let me know what you think.
Thanks for taking the time to engage with the patches. I'm currently
rerunning my performance tests on a rebased copy of the --full-name-hash
patches and will submit a new version when it's ready.

Thanks,
-Stolee


  reply	other threads:[~2024-11-04 15:48 UTC|newest]

Thread overview: 67+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-31  6:26 [PATCH 0/6] PATH WALK I: The path-walk API Derrick Stolee via GitGitGadget
2024-10-31  6:26 ` [PATCH 1/6] path-walk: introduce an object walk by path Derrick Stolee via GitGitGadget
2024-11-01 13:12   ` karthik nayak
2024-11-01 13:44     ` Derrick Stolee
     [not found]   ` <draft-87r07v14kl.fsf@archlinux.mail-host-address-is-not-set>
2024-11-01 13:42     ` karthik nayak
2024-10-31  6:26 ` [PATCH 2/6] test-lib-functions: add test_cmp_sorted Derrick Stolee via GitGitGadget
2024-10-31  6:27 ` [PATCH 3/6] t6601: add helper for testing path-walk API Derrick Stolee via GitGitGadget
2024-11-01 13:46   ` karthik nayak
2024-11-01 22:23   ` Jonathan Tan
2024-11-04 15:56     ` Derrick Stolee
2024-11-04 23:39       ` Jonathan Tan
2024-11-08 14:53         ` Derrick Stolee
2024-11-06 14:04   ` Patrick Steinhardt
2024-11-08 14:58     ` Derrick Stolee
2024-10-31  6:27 ` [PATCH 4/6] path-walk: allow consumer to specify object types Derrick Stolee via GitGitGadget
2024-10-31  6:27 ` [PATCH 5/6] path-walk: visit tags and cached objects Derrick Stolee via GitGitGadget
2024-11-01 14:25   ` karthik nayak
2024-11-04 15:56     ` Derrick Stolee
2024-10-31  6:27 ` [PATCH 6/6] path-walk: mark trees and blobs as UNINTERESTING Derrick Stolee via GitGitGadget
2024-10-31 12:36 ` [PATCH 0/6] PATH WALK I: The path-walk API Derrick Stolee
2024-11-01 19:23 ` Taylor Blau
2024-11-04 15:48   ` Derrick Stolee [this message]
2024-11-04 17:25     ` Jeff King
2024-11-05  0:11       ` Junio C Hamano
2024-11-08 15:17         ` Derrick Stolee
2024-11-11  2:56           ` Junio C Hamano
2024-11-11 13:20             ` Derrick Stolee
2024-11-11 21:55             ` Jeff King
2024-11-11 22:29               ` Junio C Hamano
2024-11-11 22:04           ` Jeff King
2024-11-09 19:41 ` [PATCH v2 " Derrick Stolee via GitGitGadget
2024-11-09 19:41   ` [PATCH v2 1/6] path-walk: introduce an object walk by path Derrick Stolee via GitGitGadget
2024-11-09 19:41   ` [PATCH v2 2/6] test-lib-functions: add test_cmp_sorted Derrick Stolee via GitGitGadget
2024-11-09 19:41   ` [PATCH v2 3/6] t6601: add helper for testing path-walk API Derrick Stolee via GitGitGadget
2024-11-21 22:39     ` Taylor Blau
2024-11-09 19:41   ` [PATCH v2 4/6] path-walk: allow consumer to specify object types Derrick Stolee via GitGitGadget
2024-11-21 22:44     ` Taylor Blau
2024-11-09 19:41   ` [PATCH v2 5/6] path-walk: visit tags and cached objects Derrick Stolee via GitGitGadget
2024-11-09 19:41   ` [PATCH v2 6/6] path-walk: mark trees and blobs as UNINTERESTING Derrick Stolee via GitGitGadget
2024-11-21 22:57   ` [PATCH v2 0/6] PATH WALK I: The path-walk API Taylor Blau
2024-11-25  8:56     ` Patrick Steinhardt
2024-11-26  7:39       ` Junio C Hamano
2024-11-26  7:43         ` Patrick Steinhardt
2024-11-26  8:16           ` Junio C Hamano
2024-12-06 19:45   ` [PATCH v3 0/7] " Derrick Stolee via GitGitGadget
2024-12-06 19:45     ` [PATCH v3 1/7] path-walk: introduce an object walk by path Derrick Stolee via GitGitGadget
2024-12-13 11:58       ` Patrick Steinhardt
2024-12-18 14:21         ` Derrick Stolee
2024-12-27 14:18           ` Patrick Steinhardt
2024-12-06 19:45     ` [PATCH v3 2/7] test-lib-functions: add test_cmp_sorted Derrick Stolee via GitGitGadget
2024-12-06 19:45     ` [PATCH v3 3/7] t6601: add helper for testing path-walk API Derrick Stolee via GitGitGadget
2024-12-06 19:45     ` [PATCH v3 4/7] path-walk: allow consumer to specify object types Derrick Stolee via GitGitGadget
2024-12-06 19:45     ` [PATCH v3 5/7] path-walk: visit tags and cached objects Derrick Stolee via GitGitGadget
2024-12-13 11:58       ` Patrick Steinhardt
2024-12-18 14:23         ` Derrick Stolee
2024-12-06 19:45     ` [PATCH v3 6/7] path-walk: mark trees and blobs as UNINTERESTING Derrick Stolee via GitGitGadget
2024-12-06 19:45     ` [PATCH v3 7/7] path-walk: reorder object visits Derrick Stolee via GitGitGadget
2024-12-13 11:58     ` [PATCH v3 0/7] PATH WALK I: The path-walk API Patrick Steinhardt
2024-12-20 16:21     ` [PATCH v4 " Derrick Stolee via GitGitGadget
2024-12-20 16:21       ` [PATCH v4 1/7] path-walk: introduce an object walk by path Derrick Stolee via GitGitGadget
2024-12-27 14:18         ` Patrick Steinhardt
2024-12-20 16:21       ` [PATCH v4 2/7] test-lib-functions: add test_cmp_sorted Derrick Stolee via GitGitGadget
2024-12-20 16:21       ` [PATCH v4 3/7] t6601: add helper for testing path-walk API Derrick Stolee via GitGitGadget
2024-12-20 16:21       ` [PATCH v4 4/7] path-walk: allow consumer to specify object types Derrick Stolee via GitGitGadget
2024-12-20 16:21       ` [PATCH v4 5/7] path-walk: visit tags and cached objects Derrick Stolee via GitGitGadget
2024-12-20 16:21       ` [PATCH v4 6/7] path-walk: mark trees and blobs as UNINTERESTING Derrick Stolee via GitGitGadget
2024-12-20 16:21       ` [PATCH v4 7/7] path-walk: reorder object visits Derrick Stolee via GitGitGadget

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2d2940ef-0b26-4060-90b6-9b6969f23754@gmail.com \
    --to=stolee@gmail.com \
    --cc=christian.couder@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=gitster@pobox.com \
    --cc=johannes.schindelin@gmx.de \
    --cc=johncai86@gmail.com \
    --cc=jonathantanmy@google.com \
    --cc=kristofferhaugsbakk@fastmail.com \
    --cc=me@ttaylorr.com \
    --cc=newren@gmail.com \
    --cc=peff@peff.net \
    --cc=ps@pks.im \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).