From: Derrick Stolee <stolee@gmail.com>
To: Taylor Blau <me@ttaylorr.com>
Cc: Junio C Hamano <gitster@pobox.com>,
Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com>,
git@vger.kernel.org, johannes.schindelin@gmx.de, peff@peff.net,
ps@pks.im, johncai86@gmail.com, newren@gmail.com
Subject: Re: [PATCH 0/4] pack-objects: create new name-hash algorithm
Date: Tue, 10 Sep 2024 17:05:09 -0400 [thread overview]
Message-ID: <81139d84-b428-4019-bbd1-cda3206d2d68@gmail.com> (raw)
In-Reply-To: <ZuBeHh7uCfhq2XBB@nand.local>
On 9/10/24 10:56 AM, Taylor Blau wrote:
> On Mon, Sep 09, 2024 at 10:37:30PM -0400, Derrick Stolee wrote:
>>> I do agree that considering files at the same path from different
>>> (but close-by) revisions as the prime candidates is very important,
>>> but if you spread the "collissions" very thin by using "uniform
>>> distribution", I am afraid that you'd end up discarding anything but
>>> the blobs at the same path, which may go too far. Having name hash
>>> value that are close by no longer has any meaning in such a system.
>>
>> You are right that some "nearby" paths are lost in this change, and
>> this can be measured by trying to use this option in the pack-objects
>> process underneath a small 'git push'.
>>
>> The thing that surprised me is just how effective this is for the
>> creation of large pack-files that include many versions of most
>> files. The cross-path deltas have less of an effect here, and the
>> benefits of avoiding name-hash collisions can be overwhelming in
>> many cases.
>
> I think that Junio's suggestion is pretty interesting (though please
> take my comments here with a grain of salt, since I haven't read the
> other series yet, and am not sure how much of this is redundant).
>
> Imagine computing both the full and existing name-hash values for each
> blob/tree in the pack. Then objects would be sorted in the delta
> selection window by similar full-name hash and similar regular name hash
> values.
>
> I'm not sure which value you'd actually record in the pack, though.
> Ideally there is a hash function which captures some information about
> the full path as well as the final path component, so we could use a
> single value here, though I suspect the implementation would be more
> complicated than what is presented here.
Is the name hash stored in the pack itself? I know that it is stored
in the 'struct object_entry' data in the packing data. While we could
add another uint32_t into that struct to store both hash values, this
would increase the memory requirements of repacking by four bytes per
object. The struct seemed to be very clear about trying as hard as
possible to avoid doing that.
But maybe an alternative could be replacing that 32-bit number with
an index into an array of paths that have their hash values stored
there.
This would still involve two passes, but might still be possible. I'll
think on this.
>>> I hope you can find a solution that strikes a good balance at the
>>> end of the series (I saw only the first step), but I suspect an easy
>>> way to avoid the downsides you observed is to use both. Compare
>>> with a handful of blobs taken from nearby commits (the original
>>> object order is roughly in traversal order, and you can take
>>> advantage of that fact) from exactly the same path (using your
>>> "uniform distribution") before comparing with the blobs with close
>>> value (of the current function) like the current implementation
>>> does, may go a long way.
>>
>> Funny you should say that, since the RFC I finally submitted [1]
>> actually does just that. The --path-walk option changes the object
>> walk to consider batches of objects based on their path, computes
>> deltas among that batch, and then does the normal name-hash pass
>> later. This seems to really strike the balance that you are
>> looking for and solves the issues where small pushes need to stay
>> small. It also fixes some problematic cases even when pushing a
>> single commit.
>
> Interesting.
>
>> [1] https://lore.kernel.org/git/pull.1786.git.1725935335.gitgitgadget@gmail.com/
>
>> However, the --path-walk option requires significant implementation
>> of a "path walk API" and my RFC doesn't even do threading right.
>> The --path-walk version (probably) doesn't work with delta islands
>> or other features the same way as the drop-in change to the name-
>> hash heuristic can. For that reason, I think there is likely some
>> long-term value to the --full-name-hash option even though the
>> --path-walk option will be better in many cases.
>
> I suspect that this is going to be a significant sticking point. Not
> supporting multi-threading is work-able for GitHub (since we set
> pack.threads=1 today), but lacking support for delta-islands makes this
> a non-starter to run at GitHub.
>
> Do you imagine that the --path-walk option could be made to work with
> delta islands? I'm not all that worried about who does that work, but
> more interested at this point in whether or not it's even possible to
> implement.
This is part of the reason why I think the --full-name-hash option is
an interesting consideration. It doesn't have any obvious reason why
it couldn't work with features like delta islands, so it may provide
some quick wins in "large enough" repositories, or at least "large in
the right way".
I unfortunately don't know enough about how the delta islands feature
works to be confident in the possibility of integrating it with the
--path-walk option. At minimum, it would require two object walks:
the first would mark the objects and the second would do the delta
compression with those markings in mind.
But if there is a way to combine both approaches with a two-pass
delta compression technique, then this may be all avoided. I'll give
it a try.
Thanks,
-Stolee
next prev parent reply other threads:[~2024-09-10 21:05 UTC|newest]
Thread overview: 38+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-09-09 13:56 [PATCH 0/4] pack-objects: create new name-hash algorithm Derrick Stolee via GitGitGadget
2024-09-09 13:56 ` [PATCH 1/4] pack-objects: add --full-name-hash option Derrick Stolee via GitGitGadget
2024-09-09 17:45 ` Junio C Hamano
2024-09-10 2:31 ` Derrick Stolee
2024-09-09 13:56 ` [PATCH 2/4] git-repack: update usage to match docs Derrick Stolee via GitGitGadget
2024-09-09 17:48 ` Junio C Hamano
2024-09-09 13:56 ` [PATCH 3/4] p5313: add size comparison test Derrick Stolee via GitGitGadget
2024-09-09 13:56 ` [PATCH 4/4] p5314: add a size test for name-hash collisions Derrick Stolee via GitGitGadget
2024-09-09 16:07 ` [PATCH 0/4] pack-objects: create new name-hash algorithm Derrick Stolee
2024-09-09 18:06 ` Junio C Hamano
2024-09-10 2:37 ` Derrick Stolee
2024-09-10 14:56 ` Taylor Blau
2024-09-10 21:05 ` Derrick Stolee [this message]
2024-09-11 6:32 ` Jeff King
2024-09-10 16:07 ` Junio C Hamano
2024-09-10 20:36 ` Junio C Hamano
2024-09-10 21:09 ` Derrick Stolee
2024-09-18 20:46 ` [PATCH v2 0/6] " Derrick Stolee via GitGitGadget
2024-09-18 20:46 ` [PATCH v2 1/6] pack-objects: add --full-name-hash option Derrick Stolee via GitGitGadget
2024-09-19 21:56 ` Junio C Hamano
2024-09-18 20:46 ` [PATCH v2 2/6] repack: test " Derrick Stolee via GitGitGadget
2024-09-19 22:11 ` Junio C Hamano
2024-09-18 20:46 ` [PATCH v2 3/6] pack-objects: add GIT_TEST_FULL_NAME_HASH Derrick Stolee via GitGitGadget
2024-09-19 22:22 ` Junio C Hamano
2024-09-23 1:39 ` Derrick Stolee
2024-09-18 20:46 ` [PATCH v2 4/6] git-repack: update usage to match docs Derrick Stolee via GitGitGadget
2024-09-18 20:46 ` [PATCH v2 5/6] p5313: add size comparison test Derrick Stolee via GitGitGadget
2024-09-19 21:58 ` Junio C Hamano
2024-09-23 1:50 ` Derrick Stolee
2024-09-23 16:14 ` Junio C Hamano
2024-09-18 20:46 ` [PATCH v2 6/6] test-tool: add helper for name-hash values Derrick Stolee via GitGitGadget
2024-09-24 7:02 ` Patrick Steinhardt
2024-09-25 13:35 ` Derrick Stolee
2024-09-18 23:30 ` [PATCH v2 0/6] pack-objects: create new name-hash algorithm Derrick Stolee
2024-10-04 21:17 ` Junio C Hamano
2024-10-04 22:30 ` Taylor Blau
2024-10-04 22:35 ` Jeff King
2024-10-08 18:32 ` Derrick Stolee
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=81139d84-b428-4019-bbd1-cda3206d2d68@gmail.com \
--to=stolee@gmail.com \
--cc=git@vger.kernel.org \
--cc=gitgitgadget@gmail.com \
--cc=gitster@pobox.com \
--cc=johannes.schindelin@gmx.de \
--cc=johncai86@gmail.com \
--cc=me@ttaylorr.com \
--cc=newren@gmail.com \
--cc=peff@peff.net \
--cc=ps@pks.im \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).