git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Derrick Stolee <stolee@gmail.com>
To: Junio C Hamano <gitster@pobox.com>
Cc: Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com>,
	git@vger.kernel.org, johannes.schindelin@gmx.de, peff@peff.net,
	ps@pks.im, me@ttaylorr.com, johncai86@gmail.com,
	newren@gmail.com
Subject: Re: [PATCH 0/4] pack-objects: create new name-hash algorithm
Date: Tue, 10 Sep 2024 17:09:22 -0400	[thread overview]
Message-ID: <1fe607db-04eb-4a6b-a523-9e18397f2d16@gmail.com> (raw)
In-Reply-To: <xmqqikv3dz5l.fsf@gitster.g>

On 9/10/24 4:36 PM, Junio C Hamano wrote:
> Junio C Hamano <gitster@pobox.com> writes:
> 
>> Derrick Stolee <stolee@gmail.com> writes:
>>
>>> The thing that surprised me is just how effective this is for the
>>> creation of large pack-files that include many versions of most
>>> files. The cross-path deltas have less of an effect here, and the
>>> benefits of avoiding name-hash collisions can be overwhelming in
>>> many cases.
>>
>> Yes, "make sure we notice a file F moving from directory A to B" is
>> inherently optimized for short span of history, i.e. a smallish push
>> rather than a whole history clone, where the definition of
>> "smallish" is that even if you create optimal delta chains, the
>> length of these delta chains will not exceed the "--depth" option.
>>
>> If the history you are pushing modified A/F twice, renamed it to B/F
>> (with or without modification at the same time), then modified B/F
>> twice more, you'd want to pack the 5-commit segment and having to
>> artificially cut the delta chain that can contain all of these 5
>> blobs into two at the renaming commit is a huge loss.
> 
> Which actually leads me to suspect that we probably do not even have
> to expose the --full-name-hash option to the end users in "git repack".
> 
> If we are doing incremental that would fit within the depth setting,
> it is likely that we would be better off without the full-name-hash
> optimization, and if we are doing "repack -a" for the whole
> repository, especially with "-f", it would make sense to do the
> full-name-hash optimization.

Depending on how much we learn from others testing the --full-name-hash
option, I could see the potential that -a could imply --full-name-hash.
I hesitate to introduce that in the first release with this option,
though.

> If we can tell how large a chunk of history we are packing before we
> actually start calling builtin/pack-objects.c:add_object_entry(), we
> probably should be able to even select between with and without
> full-name-hash automatically, but I do not think we know the object
> count before we finish calling add_object_entry(), so unless we are
> willing to compute and keep both while reading and pick between the
> two after we finish reading the list of objects, or something, it
> will require a major surgery to do so, I am afraid.

It's also possible that we could check the list of paths at HEAD to
see how many collisions the default name-hash gives. In cases like
the Git repository, there are very few collisions and thus we don't
need to use --full-name-hash. Restricting to just HEAD (or the
default ref) is not a complete analysis, but might be a good
heuristic.

Thanks,
-Stolee


  reply	other threads:[~2024-09-10 21:09 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-09 13:56 [PATCH 0/4] pack-objects: create new name-hash algorithm Derrick Stolee via GitGitGadget
2024-09-09 13:56 ` [PATCH 1/4] pack-objects: add --full-name-hash option Derrick Stolee via GitGitGadget
2024-09-09 17:45   ` Junio C Hamano
2024-09-10  2:31     ` Derrick Stolee
2024-09-09 13:56 ` [PATCH 2/4] git-repack: update usage to match docs Derrick Stolee via GitGitGadget
2024-09-09 17:48   ` Junio C Hamano
2024-09-09 13:56 ` [PATCH 3/4] p5313: add size comparison test Derrick Stolee via GitGitGadget
2024-09-09 13:56 ` [PATCH 4/4] p5314: add a size test for name-hash collisions Derrick Stolee via GitGitGadget
2024-09-09 16:07 ` [PATCH 0/4] pack-objects: create new name-hash algorithm Derrick Stolee
2024-09-09 18:06 ` Junio C Hamano
2024-09-10  2:37   ` Derrick Stolee
2024-09-10 14:56     ` Taylor Blau
2024-09-10 21:05       ` Derrick Stolee
2024-09-11  6:32         ` Jeff King
2024-09-10 16:07     ` Junio C Hamano
2024-09-10 20:36       ` Junio C Hamano
2024-09-10 21:09         ` Derrick Stolee [this message]
2024-09-18 20:46 ` [PATCH v2 0/6] " Derrick Stolee via GitGitGadget
2024-09-18 20:46   ` [PATCH v2 1/6] pack-objects: add --full-name-hash option Derrick Stolee via GitGitGadget
2024-09-19 21:56     ` Junio C Hamano
2024-09-18 20:46   ` [PATCH v2 2/6] repack: test " Derrick Stolee via GitGitGadget
2024-09-19 22:11     ` Junio C Hamano
2024-09-18 20:46   ` [PATCH v2 3/6] pack-objects: add GIT_TEST_FULL_NAME_HASH Derrick Stolee via GitGitGadget
2024-09-19 22:22     ` Junio C Hamano
2024-09-23  1:39       ` Derrick Stolee
2024-09-18 20:46   ` [PATCH v2 4/6] git-repack: update usage to match docs Derrick Stolee via GitGitGadget
2024-09-18 20:46   ` [PATCH v2 5/6] p5313: add size comparison test Derrick Stolee via GitGitGadget
2024-09-19 21:58     ` Junio C Hamano
2024-09-23  1:50       ` Derrick Stolee
2024-09-23 16:14         ` Junio C Hamano
2024-09-18 20:46   ` [PATCH v2 6/6] test-tool: add helper for name-hash values Derrick Stolee via GitGitGadget
2024-09-24  7:02     ` Patrick Steinhardt
2024-09-25 13:35       ` Derrick Stolee
2024-09-18 23:30   ` [PATCH v2 0/6] pack-objects: create new name-hash algorithm Derrick Stolee
2024-10-04 21:17   ` Junio C Hamano
2024-10-04 22:30     ` Taylor Blau
2024-10-04 22:35     ` Jeff King
2024-10-08 18:32   ` Derrick Stolee

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1fe607db-04eb-4a6b-a523-9e18397f2d16@gmail.com \
    --to=stolee@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=gitster@pobox.com \
    --cc=johannes.schindelin@gmx.de \
    --cc=johncai86@gmail.com \
    --cc=me@ttaylorr.com \
    --cc=newren@gmail.com \
    --cc=peff@peff.net \
    --cc=ps@pks.im \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).