From: Derrick Stolee <stolee@gmail.com>
To: Junio C Hamano <gitster@pobox.com>
Cc: Christian Couder <christian.couder@gmail.com>,
Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com>,
git@vger.kernel.org, johannes.schindelin@gmx.de, peff@peff.net,
ps@pks.im, me@ttaylorr.com, johncai86@gmail.com,
newren@gmail.com
Subject: Re: [PATCH 00/30] [RFC] Path-walk API and applications
Date: Sun, 22 Sep 2024 21:22:59 -0400 [thread overview]
Message-ID: <6b672771-4016-49e8-a045-0a48bc8c1522@gmail.com> (raw)
In-Reply-To: <xmqqplov7cw8.fsf@gitster.g>
On 9/22/24 2:37 PM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
>
>> Combining the two features actually ends up with very similar performance
>> to what `--full-name-hash` already does. It's actually important that the
>> `--path-walk` option does a full pass of the objects via the standard
>> name-hash after its first pass in groups based on the path.
>> ...
>> I was not clear about this, but the RFC is 30 patches so it's possible to see
>> the big picture, but I will be breaking it into at least four series in
>> sequence for actual review. They match the four sections described above, but
>> will be in the opposite order:
>>
>> A. `git repack --full-name-hash`
>> B. `git pack-objects --path-walk`
>> C. `git survey`
>> D. `git backfill`
>>
>> (It's possible that `git survey` and `git backfill` may be orthogonal enough
>> that they could be under review at the same time. Alternatively, `git backfill`
>> may jump the line because it's so simple to implement once the path-walk API
>> is established.)
>
> I actually was hoping to hear something like "since it turns out
> that --path-walk gives a better performance and it does not regress
> small incremental transfer like --full-name-hash does, the real
> series drops --full-name hash", i.e. without part (A). That reduces
> things we need to worry about (like having to either keep track of
> two "hashes" per object, or making small incremental transfer more
> costly) greatly.
I believe that the --full-name-hash version still has some benefits, in
that it could better integrate with reachability bitmaps and delta
islands:
1. The .bitmap file format would need a modification in order to signal
which hash function is being used for compatibility reasons, but
this does seem within reach without too much work.
2. The delta islands feature integrates seamlessly with
--full-name-hash and seems difficult to integrate with the
--path-walk feature. Either we would need to have a second object
walk to get the delta island markers, or somehow put the passing of
the object markers into the path-walk API itself (similar to how it
needs to push the UNINTERESTING bit around during the walk).
I'm not recommending any version that requires tracking two hash values
per object, as I have not been able to demonstrate any improvement when
doing so.
But, it would be helpful to know if the --full-name-hash feature should
not be pursued due to the --path-walk feature being prepared shortly
after it. I can see an argument for either direction: having a new hash
algorithm provides a smaller change to get most of the results for the
full repack case, but gets worse performance in many push scenarios.
This is the point of an RFC, to get questions like this worked out based
on the "big picture" view of everything.
Perhaps I should pause the --full-name-hash topic and focus on getting
the --path-walk topic up and running. I am curious to hear from folks
who are currently running Git servers about their thoughts on these
trade-offs and potential uses in their environment. My needs on the
client side are solved by the --path-walk approach.
Thanks,
-Stolee
next prev parent reply other threads:[~2024-09-23 1:23 UTC|newest]
Thread overview: 38+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-09-10 2:28 [PATCH 00/30] [RFC] Path-walk API and applications Derrick Stolee via GitGitGadget
2024-09-10 2:28 ` [PATCH 01/30] path-walk: introduce an object walk by path Derrick Stolee via GitGitGadget
2024-09-10 2:28 ` [PATCH 02/30] backfill: add builtin boilerplate Derrick Stolee via GitGitGadget
2024-09-10 2:28 ` [PATCH 03/30] backfill: basic functionality and tests Derrick Stolee via GitGitGadget
2024-09-10 2:28 ` [PATCH 04/30] backfill: add --batch-size=<n> option Derrick Stolee via GitGitGadget
2024-09-10 2:28 ` [PATCH 05/30] backfill: add --sparse option Derrick Stolee via GitGitGadget
2024-09-10 2:28 ` [PATCH 06/30] backfill: assume --sparse when sparse-checkout is enabled Derrick Stolee via GitGitGadget
2024-09-10 2:28 ` [PATCH 07/30] path-walk: allow consumer to specify object types Derrick Stolee via GitGitGadget
2024-09-10 2:28 ` [PATCH 08/30] path-walk: allow visiting tags Derrick Stolee via GitGitGadget
2024-09-10 2:28 ` [PATCH 09/30] survey: stub in new experimental `git-survey` command Jeff Hostetler via GitGitGadget
2024-09-10 2:28 ` [PATCH 10/30] survey: add command line opts to select references Jeff Hostetler via GitGitGadget
2024-09-10 2:28 ` [PATCH 11/30] survey: collect the set of requested refs Jeff Hostetler via GitGitGadget
2024-09-10 2:28 ` [PATCH 12/30] survey: start pretty printing data in table form Derrick Stolee via GitGitGadget
2024-09-10 2:28 ` [PATCH 13/30] survey: add object count summary Derrick Stolee via GitGitGadget
2024-09-10 2:28 ` [PATCH 14/30] survey: summarize total sizes by object type Derrick Stolee via GitGitGadget
2024-09-10 2:28 ` [PATCH 15/30] survey: show progress during object walk Derrick Stolee via GitGitGadget
2024-09-10 2:28 ` [PATCH 16/30] survey: add ability to track prioritized lists Derrick Stolee via GitGitGadget
2024-09-10 2:28 ` [PATCH 17/30] survey: add report of "largest" paths Derrick Stolee via GitGitGadget
2024-09-10 2:28 ` [PATCH 18/30] revision: create mark_trees_uninteresting_dense() Derrick Stolee via GitGitGadget
2024-09-10 2:28 ` [PATCH 19/30] path-walk: add prune_all_uninteresting option Derrick Stolee via GitGitGadget
2024-09-10 2:28 ` [PATCH 20/30] pack-objects: add --path-walk option Derrick Stolee via GitGitGadget
2024-09-10 2:28 ` [PATCH 21/30] pack-objects: extract should_attempt_deltas() Derrick Stolee via GitGitGadget
2024-09-10 2:28 ` [PATCH 22/30] pack-objects: introduce GIT_TEST_PACK_PATH_WALK Derrick Stolee via GitGitGadget
2024-09-10 2:28 ` [PATCH 23/30] p5313: add size comparison test Derrick Stolee via GitGitGadget
2024-09-10 2:28 ` [PATCH 24/30] repack: add --path-walk option Derrick Stolee via GitGitGadget
2024-09-10 2:28 ` [PATCH 25/30] pack-objects: enable --path-walk via config Derrick Stolee via GitGitGadget
2024-09-10 2:28 ` [PATCH 26/30] scalar: enable path-walk during push " Derrick Stolee via GitGitGadget
2024-09-10 2:28 ` [PATCH 27/30] pack-objects: add --full-name-hash option Derrick Stolee via GitGitGadget
2024-09-10 2:28 ` [PATCH 28/30] test-name-hash: add helper to compute name-hash functions Derrick Stolee via GitGitGadget
2024-09-10 2:28 ` [PATCH 29/30] p5314: add a size test for name-hash collisions Derrick Stolee via GitGitGadget
2024-09-10 2:28 ` [PATCH 30/30] pack-objects: output debug info about deltas Derrick Stolee via GitGitGadget
2024-09-11 21:32 ` [PATCH 00/30] [RFC] Path-walk API and applications Junio C Hamano
2024-09-17 10:41 ` Christian Couder
2024-09-18 23:18 ` Derrick Stolee
2024-09-22 18:37 ` Junio C Hamano
2024-09-23 1:22 ` Derrick Stolee [this message]
2024-09-23 16:56 ` Junio C Hamano
2024-09-22 21:08 ` Kristoffer Haugsbakk
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=6b672771-4016-49e8-a045-0a48bc8c1522@gmail.com \
--to=stolee@gmail.com \
--cc=christian.couder@gmail.com \
--cc=git@vger.kernel.org \
--cc=gitgitgadget@gmail.com \
--cc=gitster@pobox.com \
--cc=johannes.schindelin@gmx.de \
--cc=johncai86@gmail.com \
--cc=me@ttaylorr.com \
--cc=newren@gmail.com \
--cc=peff@peff.net \
--cc=ps@pks.im \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).