From: Taylor Blau <me@ttaylorr.com>
To: Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com>
Cc: git@vger.kernel.org, gitster@pobox.com,
johannes.schindelin@gmx.de, peff@peff.net, ps@pks.im,
johncai86@gmail.com, newren@gmail.com, jonathantanmy@google.com,
karthik nayak <karthik.188@gmail.com>,
Derrick Stolee <stolee@gmail.com>
Subject: Re: [PATCH v3 2/8] pack-objects: add --name-hash-version option
Date: Wed, 22 Jan 2025 17:17:36 -0500 [thread overview]
Message-ID: <Z5FugEXhdKjhwcnP@nand.local> (raw)
In-Reply-To: <d035e3e59f42c75760e8dd6fe8ed6dff12bc8b9a.1734715194.git.gitgitgadget@gmail.com>
On Fri, Dec 20, 2024 at 05:19:48PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <stolee@gmail.com>
>
> The previous change introduced a new pack_name_hash_v2() function that
> intends to satisfy much of the hash locality features of the existing
> pack_name_hash() function while also distinguishing paths with similar
> final components of their paths.
>
> This change adds a new --name-hash-version option for 'git pack-objects'
> to allow users to select their preferred function version. This use of
> an integer version allows for future expansion and a direct way to later
> store a name hash version in the .bitmap format.
>
> For now, let's consider how effective this mechanism is when repacking a
> repository with different name hash versions. Specifically, we will
> execute 'git pack-objects' the same way a 'git repack -adf' process
> would, except we include --name-hash-version=<n> for testing.
>
> On the Git repository, we do not expect much difference. All path names
> are short. This is backed by our results:
>
> | Stage | Pack Size | Repack Time |
> |-----------------------|-----------|-------------|
> | After clone | 260 MB | N/A |
> | --name-hash-version=1 | 127 MB | 129s |
> | --name-hash-version=2 | 127 MB | 112s |
>
> This example demonstrates how there is some natural overhead coming from
> the cloned copy because the server is hosting many forks and has not
> optimized for exactly this set of reachable objects. But the full repack
> has similar characteristics for both versions.
>
> Let's consider some repositories that are hitting too many collisions
> with version 1. First, let's explore the kinds of paths that are
> commonly causing these collisions:
>
> * "/CHANGELOG.json" is 15 characters, and is created by the beachball
> [1] tool. Only the final character of the parent directory can
> differentiate different versions of this file, but also only the two
> most-significant digits. If that character is a letter, then this is
> always a collision. Similar issues occur with the similar
> "/CHANGELOG.md" path, though there is more opportunity for
> differences In the parent directory.
>
> * Localization files frequently have common filenames but
> differentiates via parent directories. In C#, the name
> "/strings.resx.lcl" is used for these localization files and they
> will all collide in name-hash.
>
> [1] https://github.com/microsoft/beachball
>
> I've come across many other examples where some internal tool uses a
> common name across multiple directories and is causing Git to repack
> poorly due to name-hash collisions.
>
> One open-source example is the fluentui [2] repo, which uses beachball
> to generate CHANGELOG.json and CHANGELOG.md files, and these files have
> very poor delta characteristics when comparing against versions across
> parent directories.
>
> | Stage | Pack Size | Repack Time |
> |-----------------------|-----------|-------------|
> | After clone | 694 MB | N/A |
> | --name-hash-version=1 | 438 MB | 728s |
> | --name-hash-version=2 | 168 MB | 142s |
>
> [2] https://github.com/microsoft/fluentui
>
> In this example, we see significant gains in the compressed packfile
> size as well as the time taken to compute the packfile.
>
> Using a collection of repositories that use the beachball tool, I was
> able to make similar comparisions with dramatic results. While the
> fluentui repo is public, the others are private so cannot be shared for
> reproduction. The results are so significant that I find it important to
> share here:
>
> | Repo | --name-hash-version=1 | --name-hash-version=2 |
> |----------|-----------------------|-----------------------|
> | fluentui | 440 MB | 161 MB |
> | Repo B | 6,248 MB | 856 MB |
> | Repo C | 37,278 MB | 6,755 MB |
> | Repo D | 131,204 MB | 7,463 MB |
>
> Future changes could include making --name-hash-version implied by a config
> value or even implied by default during a full repack.
>
> It is important to point out that the name hash value is stored in the
> .bitmap file format, so we must force --name-hash-version=1 when bitmaps
> are being read or written. Later, the bitmap format could be updated to
> be aware of the name hash version so deltas can be quickly computed
> across the bitmapped/not-bitmapped boundary.
>
> Signed-off-by: Derrick Stolee <stolee@gmail.com>
> ---
> Documentation/git-pack-objects.txt | 32 ++++++++++++++++++-
> builtin/pack-objects.c | 49 +++++++++++++++++++++++++++---
> t/t5300-pack-object.sh | 31 +++++++++++++++++++
> 3 files changed, 106 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
> index e32404c6aae..7f69ae4855f 100644
> --- a/Documentation/git-pack-objects.txt
> +++ b/Documentation/git-pack-objects.txt
> @@ -15,7 +15,8 @@ SYNOPSIS
> [--revs [--unpacked | --all]] [--keep-pack=<pack-name>]
> [--cruft] [--cruft-expiration=<time>]
> [--stdout [--filter=<filter-spec>] | <base-name>]
> - [--shallow] [--keep-true-parents] [--[no-]sparse] < <object-list>
> + [--shallow] [--keep-true-parents] [--[no-]sparse]
> + [--name-hash-version=<n>] < <object-list>
>
>
> DESCRIPTION
> @@ -345,6 +346,35 @@ raise an error.
> Restrict delta matches based on "islands". See DELTA ISLANDS
> below.
>
> +--name-hash-version=<n>::
> + While performing delta compression, Git groups objects that may be
> + similar based on heuristics using the path to that object. While
> + grouping objects by an exact path match is good for paths with
> + many versions, there are benefits for finding delta pairs across
> + different full paths. Git collects objects by type and then by a
> + "name hash" of the path and then by size, hoping to group objects
> + that will compress well together.
> ++
> +The default name hash version is `1`, which prioritizes hash locality by
> +considering the final bytes of the path as providing the maximum magnitude
> +to the hash function. This version excels at distinguishing short paths
> +and finding renames across directories. However, the hash function depends
> +primarily on the final 16 bytes of the path. If there are many paths in
> +the repo that have the same final 16 bytes and differ only by parent
> +directory, then this name-hash may lead to too many collisions and cause
> +poor results. At the moment, this version is required when writing
> +reachability bitmap files with `--write-bitmap-index`.
> ++
> +The name hash version `2` has similar locality features as version `1`,
> +except it considers each path component separately and overlays the hashes
> +with a shift. This still prioritizes the final bytes of the path, but also
> +"salts" the lower bits of the hash using the parent directory names. This
> +method allows for some of the locality benefits of version `1` while
> +breaking most of the collisions from a similarly-named file appearing in
> +many different directories. At the moment, this version is not allowed
> +when writing reachability bitmap files with `--write-bitmap-index` and it
> +will be automatically changed to version `1`.
> +
>
> DELTA ISLANDS
> -------------
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index 08007142671..90ea19417bc 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -266,6 +266,28 @@ struct configured_exclusion {
> static struct oidmap configured_exclusions;
>
> static struct oidset excluded_by_config;
> +static int name_hash_version = 1;
> +
> +static void validate_name_hash_version(void)
> +{
> + if (name_hash_version < 1 || name_hash_version > 2)
> + die(_("invalid --name-hash-version option: %d"), name_hash_version);
> +}
> +
> +static inline uint32_t pack_name_hash_fn(const char *name)
> +{
> + switch (name_hash_version)
> + {
This is definitely a nitpick, but the opening curly brace should appear
on the preceding line. I don't think our CodingGuidelines explicitly say
this. But in my head the convention is that only function bodies have
their enclosing braces on their own line, as opposed to:
static inline uint32_t pack_name_hash_fn(const char *name) {
> + case 1:
> + return pack_name_hash(name);
> +
> + case 2:
> + return pack_name_hash_v2((const unsigned char *)name);
> +
> + default:
> + BUG("invalid name-hash version: %d", name_hash_version);
> + }
> +}
Otherwise this function looks very reasonable.
> @@ -4576,6 +4609,12 @@ int cmd_pack_objects(int argc,
> if (pack_to_stdout || !rev_list_all)
> write_bitmap_index = 0;
>
> + validate_name_hash_version();
> + if (write_bitmap_index && name_hash_version != 1) {
> + warning(_("currently, --write-bitmap-index requires --name-hash-version=1"));
> + name_hash_version = 1;
> + }
> +
Hmm. validate_name_hash_version() is its own function, which seems good,
but then we do further validation on it here. I wonder if we should
either move the latter step into validate_name_hash_version(), which
would require us to pass a pointer to name_hash_version, or assign it
the return value of validate_name_hash_version() (assuming it were
changed to return the appropriate type instead of void).
I think that we are probably pretty far into bike-shedding territory,
but figured I'd share as it jumped out to me while reviewing.
> if (use_delta_islands)
> strvec_push(&rp, "--topo-order");
>
> diff --git a/t/t5300-pack-object.sh b/t/t5300-pack-object.sh
> index 3b9dae331a5..4270eabe8b7 100755
> --- a/t/t5300-pack-object.sh
> +++ b/t/t5300-pack-object.sh
> @@ -674,4 +674,35 @@ do
> '
> done
>
> +test_expect_success 'valid and invalid --name-hash-versions' '
> + # Valid values are hard to verify other than "do not fail".
> + # Performance tests will be more valuable to validate these versions.
> + for value in 1 2
> + do
> + git pack-objects base --all --name-hash-version=$value || return 1
> + done &&
> +
> + # Invalid values have clear post-conditions.
> + for value in -1 0 3
> + do
> + test_must_fail git pack-objects base --all --name-hash-version=$value 2>err &&
> + test_grep "invalid --name-hash-version option" err || return 1
> + done
> +'
Looks great, and thanks for handling a few nonsensical values in the
latter loop.
> +# The following test is not necessarily a permanent choice, but since we do not
> +# have a "name hash version" bit in the .bitmap file format, we cannot write the
> +# hash values into the .bitmap file without risking breakage later.
> +#
> +# TODO: Make these compatible in the future and replace this test with the
> +# expected behavior when both are specified.
> +test_expect_success '--name-hash-version=2 and --write-bitmap-index are incompatible' '
> + git pack-objects base --all --name-hash-version=2 --write-bitmap-index 2>err &&
> + test_grep "currently, --write-bitmap-index requires --name-hash-version=1" err &&
> +
> + # --stdout option silently removes --write-bitmap-index
> + git pack-objects --stdout --all --name-hash-version=2 --write-bitmap-index >out 2>err &&
> + ! test_grep "currently, --write-bitmap-index requires --name-hash-version=1" err
> +'
This is grea, too, I appreciate having the reminder here for the future
(and it'll feel so satisfying to change/remove this test when the time
comes) ;-).
Thanks,
Taylor
next prev parent reply other threads:[~2025-01-22 22:17 UTC|newest]
Thread overview: 93+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-11-05 3:05 [PATCH 0/7] pack-objects: Create an alternative name hash algorithm (recreated) Derrick Stolee via GitGitGadget
2024-11-05 3:05 ` [PATCH 1/7] pack-objects: add --full-name-hash option Derrick Stolee via GitGitGadget
2024-11-21 20:08 ` Taylor Blau
2024-11-21 21:35 ` Taylor Blau
2024-11-21 23:32 ` Junio C Hamano
2024-11-22 11:46 ` Derrick Stolee
2024-11-22 11:59 ` Derrick Stolee
2024-11-26 8:26 ` Patrick Steinhardt
2024-11-05 3:05 ` [PATCH 2/7] repack: " Derrick Stolee via GitGitGadget
2024-11-21 20:12 ` Taylor Blau
2024-11-22 12:07 ` Derrick Stolee
2024-11-05 3:05 ` [PATCH 3/7] pack-objects: add GIT_TEST_FULL_NAME_HASH Derrick Stolee via GitGitGadget
2024-11-21 20:15 ` Taylor Blau
2024-11-22 12:09 ` Derrick Stolee
2024-11-22 1:13 ` Jonathan Tan
2024-11-22 3:23 ` Junio C Hamano
2024-11-22 18:01 ` Jonathan Tan
2024-11-25 0:39 ` Junio C Hamano
2024-11-25 19:45 ` Jonathan Tan
2024-11-26 1:29 ` Junio C Hamano
2024-11-26 8:26 ` Patrick Steinhardt
2024-11-05 3:05 ` [PATCH 4/7] git-repack: update usage to match docs Derrick Stolee via GitGitGadget
2024-11-21 20:17 ` Taylor Blau
2024-11-22 15:26 ` Derrick Stolee
2024-11-05 3:05 ` [PATCH 5/7] p5313: add size comparison test Derrick Stolee via GitGitGadget
2024-11-21 20:31 ` Taylor Blau
2024-11-22 15:26 ` Derrick Stolee
2024-11-26 8:26 ` Patrick Steinhardt
2024-11-05 3:05 ` [PATCH 6/7] pack-objects: disable --full-name-hash when shallow Derrick Stolee via GitGitGadget
2024-11-21 20:33 ` Taylor Blau
2024-11-22 15:27 ` Derrick Stolee
2024-11-05 3:05 ` [PATCH 7/7] test-tool: add helper for name-hash values Derrick Stolee via GitGitGadget
2024-11-21 20:42 ` Taylor Blau
2024-11-22 1:23 ` Jonathan Tan
2024-11-21 23:50 ` [PATCH 0/7] pack-objects: Create an alternative name hash algorithm (recreated) Jonathan Tan
2024-11-22 3:01 ` Junio C Hamano
2024-11-22 4:22 ` Junio C Hamano
2024-11-22 15:27 ` Derrick Stolee
2024-11-24 23:57 ` Junio C Hamano
2024-11-22 18:05 ` Jonathan Tan
2024-12-02 23:21 ` [PATCH v2 0/8] " Derrick Stolee via GitGitGadget
2024-12-02 23:21 ` [PATCH v2 1/8] pack-objects: create new name-hash function version Jonathan Tan via GitGitGadget
2024-12-04 20:06 ` karthik nayak
2024-12-04 21:05 ` Junio C Hamano
2024-12-05 9:46 ` karthik nayak
2024-12-09 23:15 ` Jonathan Tan
2024-12-10 0:01 ` Junio C Hamano
2024-12-02 23:21 ` [PATCH v2 2/8] pack-objects: add --name-hash-version option Derrick Stolee via GitGitGadget
2024-12-04 20:53 ` karthik nayak
2024-12-02 23:21 ` [PATCH v2 3/8] repack: " Derrick Stolee via GitGitGadget
2024-12-04 21:15 ` karthik nayak
2024-12-02 23:21 ` [PATCH v2 4/8] pack-objects: add GIT_TEST_NAME_HASH_VERSION Derrick Stolee via GitGitGadget
2024-12-04 21:21 ` karthik nayak
2024-12-09 23:12 ` Jonathan Tan
2024-12-20 17:03 ` Derrick Stolee
2024-12-02 23:21 ` [PATCH v2 5/8] p5313: add size comparison test Derrick Stolee via GitGitGadget
2024-12-02 23:21 ` [PATCH v2 6/8] test-tool: add helper for name-hash values Derrick Stolee via GitGitGadget
2024-12-02 23:21 ` [PATCH v2 7/8] pack-objects: prevent name hash version change Derrick Stolee via GitGitGadget
2024-12-02 23:21 ` [PATCH v2 8/8] pack-objects: add third name hash version Derrick Stolee via GitGitGadget
2024-12-03 3:23 ` [PATCH v2 0/8] pack-objects: Create an alternative name hash algorithm (recreated) Junio C Hamano
2024-12-04 4:56 ` Derrick Stolee
2024-12-04 5:02 ` Junio C Hamano
2024-12-20 17:19 ` [PATCH v3 " Derrick Stolee via GitGitGadget
2024-12-20 17:19 ` [PATCH v3 1/8] pack-objects: create new name-hash function version Jonathan Tan via GitGitGadget
2025-01-22 22:08 ` Taylor Blau
2024-12-20 17:19 ` [PATCH v3 2/8] pack-objects: add --name-hash-version option Derrick Stolee via GitGitGadget
2025-01-22 22:17 ` Taylor Blau [this message]
2025-01-24 17:29 ` Derrick Stolee
2024-12-20 17:19 ` [PATCH v3 3/8] repack: " Derrick Stolee via GitGitGadget
2025-01-22 22:18 ` Taylor Blau
2024-12-20 17:19 ` [PATCH v3 4/8] pack-objects: add GIT_TEST_NAME_HASH_VERSION Derrick Stolee via GitGitGadget
2025-01-22 22:20 ` Taylor Blau
2024-12-20 17:19 ` [PATCH v3 5/8] p5313: add size comparison test Derrick Stolee via GitGitGadget
2024-12-20 17:19 ` [PATCH v3 6/8] test-tool: add helper for name-hash values Derrick Stolee via GitGitGadget
2024-12-20 17:19 ` [PATCH v3 7/8] pack-objects: prevent name hash version change Derrick Stolee via GitGitGadget
2025-01-22 22:22 ` Taylor Blau
2024-12-20 17:19 ` [PATCH v3 8/8] pack-objects: add third name hash version Derrick Stolee via GitGitGadget
2025-01-22 22:37 ` Taylor Blau
2025-01-24 17:34 ` Derrick Stolee
2025-01-21 20:21 ` [PATCH v3 0/8] pack-objects: Create an alternative name hash algorithm (recreated) Derrick Stolee
2025-01-22 23:28 ` Taylor Blau
2025-01-24 17:45 ` Derrick Stolee
2025-01-27 19:02 ` [PATCH v4 0/7] " Derrick Stolee via GitGitGadget
2025-01-27 19:02 ` [PATCH v4 1/7] pack-objects: create new name-hash function version Jonathan Tan via GitGitGadget
2025-01-27 19:02 ` [PATCH v4 2/7] pack-objects: add --name-hash-version option Derrick Stolee via GitGitGadget
2025-01-27 21:18 ` Junio C Hamano
2025-01-29 13:38 ` Derrick Stolee
2025-01-27 19:02 ` [PATCH v4 3/7] repack: " Derrick Stolee via GitGitGadget
2025-01-27 19:02 ` [PATCH v4 4/7] pack-objects: add GIT_TEST_NAME_HASH_VERSION Derrick Stolee via GitGitGadget
2025-01-27 19:02 ` [PATCH v4 5/7] p5313: add size comparison test Derrick Stolee via GitGitGadget
2025-01-27 19:02 ` [PATCH v4 6/7] test-tool: add helper for name-hash values Derrick Stolee via GitGitGadget
2025-01-27 19:02 ` [PATCH v4 7/7] pack-objects: prevent name hash version change Derrick Stolee via GitGitGadget
2025-01-31 21:39 ` [PATCH v4 0/7] pack-objects: Create an alternative name hash algorithm (recreated) Taylor Blau
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Z5FugEXhdKjhwcnP@nand.local \
--to=me@ttaylorr.com \
--cc=git@vger.kernel.org \
--cc=gitgitgadget@gmail.com \
--cc=gitster@pobox.com \
--cc=johannes.schindelin@gmx.de \
--cc=johncai86@gmail.com \
--cc=jonathantanmy@google.com \
--cc=karthik.188@gmail.com \
--cc=newren@gmail.com \
--cc=peff@peff.net \
--cc=ps@pks.im \
--cc=stolee@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).