Re: [PATCH 0/6] PATH WALK I: The path-walk API

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Derrick Stolee <stolee@gmail.com>
To: Junio C Hamano <gitster@pobox.com>, Jeff King <peff@peff.net>
Cc: Taylor Blau <me@ttaylorr.com>,
	Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com>,
	git@vger.kernel.org, johannes.schindelin@gmx.de, ps@pks.im,
	johncai86@gmail.com, newren@gmail.com,
	christian.couder@gmail.com, kristofferhaugsbakk@fastmail.com,
	jonathantanmy@google.com
Subject: Re: [PATCH 0/6] PATH WALK I: The path-walk API
Date: Fri, 8 Nov 2024 10:17:24 -0500	[thread overview]
Message-ID: <f02ee8ac-01e4-42e3-b99a-d9616b9ff1bb@gmail.com> (raw)
In-Reply-To: <xmqq1pzqwnck.fsf@gitster.g>

On 11/4/24 7:11 PM, Junio C Hamano wrote:
> Jeff King <peff@peff.net> writes:
> 
>> On Mon, Nov 04, 2024 at 10:48:49AM -0500, Derrick Stolee wrote:
>>> I disagree that all environments will prefer the --full-name-hash. I'm
>>> currently repeating the performance tests right now, and I've added one.
>>> The issues are:
>>>
>>>   1. The --full-name-hash approach sometimes leads to a larger pack when
>>>      using "git push" on the client, especially when the name-hash is
>>>      already effective for compressing across paths.
>>
>> That's interesting. I wonder which cases get worse, and if a larger
>> window size might help. I.e., presumably we are pushing the candidates
>> further away in the sorted delta list.

I think the cases that make things get worse with --full-name-hash are:

   1. The presence of renames, partitioning objects that used to fit into
      the same bucket (in the case of directory renames).

   2. Some standard kinds of files may appear several times across the
      tree but do not change very often and are similar across path.

   3. Common patterns across similar file types, such as similar includes
      in .c and .h files or other kinds of boilerplate in different
      languages.

A larger window size will always expand the range of possible deltas, at
a cost to the time taken to compute the deltas. My first experiment in
these repositories was to increase the window size to 250 (from the default
10). This caused a very slow repack, but the repositories shrunk.

For example, the big Javascript monorepo that repacked to ~100 GB with
default settings would repack to ~30 GB with --window=250. This was an
indicator that delta compression would work if we can find the right pairs
to use for deltas.

The point of the two strategies (--full-name-hash and --path-walk) is
about putting objects close to each other in better ways than the name
hash sort.

>>>   2. A depth 1 shallow clone cannot use previous versions of a path, so
>>>      those situations will want to use the normal name hash. This can be
>>>      accomplished simply by disabling the --full-name-hash option when
>>>      the --shallow option is present; a more detailed version could be
>>>      used to check for a large depth before disabling it. This case also
>>>      disables bitmaps, so that isn't something to worry about.
>>
>> I'm not sure why a larger hash would be worse in a shallow clone. As you
>> note, with only one version of each path the name-similarity heuristic
>> is not likely to buy you much. But I'd have thought that would be true
>> for the existing name hash as well as a longer one. Maybe this is the
>> "over-emphasizing" case.

I'm confused by your wording of "larger hash" because the hash size
is the exact same: 32 bits. It's just that the --full-name-hash option
has fewer collisions by abandoning the hope of locality.

In a depth 1 shallow clone, there are no repeated paths, so any hash
collisions are true collisions instead of good candidates for deltas.
The full name hash is essentially random, so the delta compression
algorithm basically says:

   1. Sort by type.
   2. Within each type, sort the objects randomly.

With that sort, the delta compression scan is less effective than the
standard name hash.

> I too am curious to hear Derrick explain the above points and what
> was learned from the performance tests.  The original hash was
> designed to place files that are renamed across directories closer
> to each other in the list sorted by the name hash, so a/Makefile and
> b/Makefile would likely be treated as delta-base candidates while
> foo/bar.c and bar/foo.c are treated as unrelated things.  A push
> of a handful of commits that rename paths would likely place the
> rename source of older commits and rename destination of newer
> commits into the same delta chain, even with a smaller delta window.

> In such a history, uniformly-distributed-without-regard-to-renames
> hash is likely to make them into two distinct delta chains, leading
> to less optimal delta-base selection.

Yes. This is the downside of the --full-name-hash compared to the
standard name hash. When repacking an entire repository, the effect
of these renames is typically not important in the long run as it's
basically a single break in the delta chain. The downside comes in
when doing a small fetch or push where the rename has more impact.

The --path-walk approach does not suffer from this problem because
it has a second pass that sorts by the name hash and looks for
better deltas than the ones that already exist. Thus, it gets the
best of both worlds.

The performance impact of the two passes of the --path-walk
approach is interesting, as you'd typically expect this to always
be slower. However:

  1. The delta compression within each batch only compares the
     objects within that batch. We do not compare these objects to
     unrelated objects, which can be expensive and wasteful. This
     also means that small batches may even be smaller than the
     delta window, reducing the number of comparisons.

  2. In the second pass, the delta calculation can short-circuit if
     the computed delta would be larger than the current-best delta.
     Thus, the good deltas from the first pass make the second pass
     faster.

> A whole-repository packing, or a large push or fetch, of the same
> history with renamed files are affected a lot less by such negative
> effects of full-name hash.  When generating a pack with more commits
> than the "--window", use of the original hash would mean blobs from
> paths that share similar names (e.g., "Makefile"s everywhere in the
> directory hierarchy) are placed close to each other, but full-name
> hash will likely group the blobs from exactly the same path and
> nothing else together, and the resulting delta-chain for identical
> (and not similar) paths would be sufficiently long.  A long delta
> chain has to be broken into multiple chains _anyway_ due to finite
> "--depth" setting, so placing blobs from each path into its own
> (initial) delta chain, completely ignoring renamed paths, would
> likely to give us long enough (initial) delta chain to be split at
> the depth limit.
> 
> It would lead to a good delta-base selection with smaller window
> size quite efficiently with full-name hash.>
> I think a full-name hash forces a single-commit pack of a wide tree
> to give up on deltified blobs, but with the original hash, at least
> similar and common files (e.g. Makefile and COPYING) would sit close
> together in the delta queue and can be deltified with each other,
> which may be where the inefficiency comes from when full-name hash
> is used.

Yes, this is a good summary of why this works for the data
efficiency in long histories. Your earlier observations are why
the full-name hash has demonstrated issues with smaller time scales.

These numbers are carefully detailed in the performance tests in
the refreshed series [1]. The series also has a way to disable the
full-name hash when serving a shallow clone for this reason.

[1] https://lore.kernel.org/git/pull.1823.git.1730775907.gitgitgadget@gmail.com/

Thanks,
-Stolee

next prev parent reply	other threads:[~2024-11-08 15:17 UTC|newest]

Thread overview: 67+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-31  6:26 [PATCH 0/6] PATH WALK I: The path-walk API Derrick Stolee via GitGitGadget
2024-10-31  6:26 ` [PATCH 1/6] path-walk: introduce an object walk by path Derrick Stolee via GitGitGadget
2024-11-01 13:12   ` karthik nayak
2024-11-01 13:44     ` Derrick Stolee
     [not found]   ` <draft-87r07v14kl.fsf@archlinux.mail-host-address-is-not-set>
2024-11-01 13:42     ` karthik nayak
2024-10-31  6:26 ` [PATCH 2/6] test-lib-functions: add test_cmp_sorted Derrick Stolee via GitGitGadget
2024-10-31  6:27 ` [PATCH 3/6] t6601: add helper for testing path-walk API Derrick Stolee via GitGitGadget
2024-11-01 13:46   ` karthik nayak
2024-11-01 22:23   ` Jonathan Tan
2024-11-04 15:56     ` Derrick Stolee
2024-11-04 23:39       ` Jonathan Tan
2024-11-08 14:53         ` Derrick Stolee
2024-11-06 14:04   ` Patrick Steinhardt
2024-11-08 14:58     ` Derrick Stolee
2024-10-31  6:27 ` [PATCH 4/6] path-walk: allow consumer to specify object types Derrick Stolee via GitGitGadget
2024-10-31  6:27 ` [PATCH 5/6] path-walk: visit tags and cached objects Derrick Stolee via GitGitGadget
2024-11-01 14:25   ` karthik nayak
2024-11-04 15:56     ` Derrick Stolee
2024-10-31  6:27 ` [PATCH 6/6] path-walk: mark trees and blobs as UNINTERESTING Derrick Stolee via GitGitGadget
2024-10-31 12:36 ` [PATCH 0/6] PATH WALK I: The path-walk API Derrick Stolee
2024-11-01 19:23 ` Taylor Blau
2024-11-04 15:48   ` Derrick Stolee
2024-11-04 17:25     ` Jeff King
2024-11-05  0:11       ` Junio C Hamano
2024-11-08 15:17         ` Derrick Stolee [this message]
2024-11-11  2:56           ` Junio C Hamano
2024-11-11 13:20             ` Derrick Stolee
2024-11-11 21:55             ` Jeff King
2024-11-11 22:29               ` Junio C Hamano
2024-11-11 22:04           ` Jeff King
2024-11-09 19:41 ` [PATCH v2 " Derrick Stolee via GitGitGadget
2024-11-09 19:41   ` [PATCH v2 1/6] path-walk: introduce an object walk by path Derrick Stolee via GitGitGadget
2024-11-09 19:41   ` [PATCH v2 2/6] test-lib-functions: add test_cmp_sorted Derrick Stolee via GitGitGadget
2024-11-09 19:41   ` [PATCH v2 3/6] t6601: add helper for testing path-walk API Derrick Stolee via GitGitGadget
2024-11-21 22:39     ` Taylor Blau
2024-11-09 19:41   ` [PATCH v2 4/6] path-walk: allow consumer to specify object types Derrick Stolee via GitGitGadget
2024-11-21 22:44     ` Taylor Blau
2024-11-09 19:41   ` [PATCH v2 5/6] path-walk: visit tags and cached objects Derrick Stolee via GitGitGadget
2024-11-09 19:41   ` [PATCH v2 6/6] path-walk: mark trees and blobs as UNINTERESTING Derrick Stolee via GitGitGadget
2024-11-21 22:57   ` [PATCH v2 0/6] PATH WALK I: The path-walk API Taylor Blau
2024-11-25  8:56     ` Patrick Steinhardt
2024-11-26  7:39       ` Junio C Hamano
2024-11-26  7:43         ` Patrick Steinhardt
2024-11-26  8:16           ` Junio C Hamano
2024-12-06 19:45   ` [PATCH v3 0/7] " Derrick Stolee via GitGitGadget
2024-12-06 19:45     ` [PATCH v3 1/7] path-walk: introduce an object walk by path Derrick Stolee via GitGitGadget
2024-12-13 11:58       ` Patrick Steinhardt
2024-12-18 14:21         ` Derrick Stolee
2024-12-27 14:18           ` Patrick Steinhardt
2024-12-06 19:45     ` [PATCH v3 2/7] test-lib-functions: add test_cmp_sorted Derrick Stolee via GitGitGadget
2024-12-06 19:45     ` [PATCH v3 3/7] t6601: add helper for testing path-walk API Derrick Stolee via GitGitGadget
2024-12-06 19:45     ` [PATCH v3 4/7] path-walk: allow consumer to specify object types Derrick Stolee via GitGitGadget
2024-12-06 19:45     ` [PATCH v3 5/7] path-walk: visit tags and cached objects Derrick Stolee via GitGitGadget
2024-12-13 11:58       ` Patrick Steinhardt
2024-12-18 14:23         ` Derrick Stolee
2024-12-06 19:45     ` [PATCH v3 6/7] path-walk: mark trees and blobs as UNINTERESTING Derrick Stolee via GitGitGadget
2024-12-06 19:45     ` [PATCH v3 7/7] path-walk: reorder object visits Derrick Stolee via GitGitGadget
2024-12-13 11:58     ` [PATCH v3 0/7] PATH WALK I: The path-walk API Patrick Steinhardt
2024-12-20 16:21     ` [PATCH v4 " Derrick Stolee via GitGitGadget
2024-12-20 16:21       ` [PATCH v4 1/7] path-walk: introduce an object walk by path Derrick Stolee via GitGitGadget
2024-12-27 14:18         ` Patrick Steinhardt
2024-12-20 16:21       ` [PATCH v4 2/7] test-lib-functions: add test_cmp_sorted Derrick Stolee via GitGitGadget
2024-12-20 16:21       ` [PATCH v4 3/7] t6601: add helper for testing path-walk API Derrick Stolee via GitGitGadget
2024-12-20 16:21       ` [PATCH v4 4/7] path-walk: allow consumer to specify object types Derrick Stolee via GitGitGadget
2024-12-20 16:21       ` [PATCH v4 5/7] path-walk: visit tags and cached objects Derrick Stolee via GitGitGadget
2024-12-20 16:21       ` [PATCH v4 6/7] path-walk: mark trees and blobs as UNINTERESTING Derrick Stolee via GitGitGadget
2024-12-20 16:21       ` [PATCH v4 7/7] path-walk: reorder object visits Derrick Stolee via GitGitGadget

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f02ee8ac-01e4-42e3-b99a-d9616b9ff1bb@gmail.com \
    --to=stolee@gmail.com \
    --cc=christian.couder@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=gitster@pobox.com \
    --cc=johannes.schindelin@gmx.de \
    --cc=johncai86@gmail.com \
    --cc=jonathantanmy@google.com \
    --cc=kristofferhaugsbakk@fastmail.com \
    --cc=me@ttaylorr.com \
    --cc=newren@gmail.com \
    --cc=peff@peff.net \
    --cc=ps@pks.im \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).