[PATCH 0/4] pack-objects: create new name-hash algorithm

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/4] pack-objects: create new name-hash algorithm
@ 2024-09-09 13:56 Derrick Stolee via GitGitGadget
  2024-09-09 13:56 ` [PATCH 1/4] pack-objects: add --full-name-hash option Derrick Stolee via GitGitGadget
                   ` (6 more replies)
  0 siblings, 7 replies; 38+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-09-09 13:56 UTC (permalink / raw)
  To: git
  Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
	Derrick Stolee

I've been focused recently on understanding and mitigating the growth of a
few internal repositories. Some of these are growing much larger than
expected for the number of contributors, and there are multiple aspects to
why this growth is so large.

I will be submitting an RFC with my deep dive into the many aspects of this
issue, but the very last thing I discovered in this space is actually the
easiest change to make.

The main issue plaguing these repositories is that deltas are not being
computed against objects that appear at the same path. While the size of
these files at tip is one aspect of growth that would prevent this issue,
the changes to these files are reasonable and should result in good delta
compression. However, Git is not discovering the connections across
different versions of the same file.

One way to find some improvement in these repositories is to increase the
window size, which was an initial indicator that the delta compression could
be improved, but was not a clear indicator. After some digging (and
prototyping some analysis tools) the main discovery was that the current
name-hash algorithm only considers the last 16 characters in the path name
and has some naturally-occurring collisions within that scope.

This series introduces a new name-hash algorithm, but does not replace the
existing one. There are cases, such as packing a single snapshot of a
repository, where the existing algorithm outperforms the new one.

However, my findings show that when a repository has many versions of files
at the same path (and especially when there are many name-hash collisions)
then there are significant gains to be made using the new algorithm.

Repo Standard Repack With --full-name-hash fluentui 438 MB 168 MB Repo B
6,255 MB 829 MB Repo C 37,737 MB 7,125 MB Repo D 130,049 MB 6,190 MB

The main change in this series is in patch 1, which adds the algorithm and
the option to 'git pack-objects' and 'git repack'. The remaining patches are
focused on creating more evidence around the value of the new name-hash
algorithm and its effects on the packfiles created with it.

I will also try to make clear that I've been focused on client-side
performance and size concerns. I do not know if using this option will have
issues with advanced server-side repacking features, such as delta islands,
reachability bitmaps, or serving clones and fetches from the resulting
packfile. My educated guess is that the name-hash value does not affect
these features in any direct way, but I'll leave the testing of the server
scenarios to the experts.

Thanks, -Stolee

Derrick Stolee (4):
  pack-objects: add --full-name-hash option
  git-repack: update usage to match docs
  p5313: add size comparison test
  p5314: add a size test for name-hash collisions

 Documentation/git-pack-objects.txt |  3 +-
 Documentation/git-repack.txt       |  4 +-
 Makefile                           |  1 +
 builtin/pack-objects.c             | 20 ++++++---
 builtin/repack.c                   |  9 +++-
 pack-objects.h                     | 20 +++++++++
 t/helper/test-name-hash.c          | 23 ++++++++++
 t/helper/test-tool.c               |  1 +
 t/helper/test-tool.h               |  1 +
 t/perf/p5313-pack-objects.sh       | 71 ++++++++++++++++++++++++++++++
 t/perf/p5314-name-hash.sh          | 41 +++++++++++++++++
 t/t0450/txt-help-mismatches        |  1 -
 12 files changed, 186 insertions(+), 9 deletions(-)
 create mode 100644 t/helper/test-name-hash.c
 create mode 100755 t/perf/p5313-pack-objects.sh
 create mode 100755 t/perf/p5314-name-hash.sh

base-commit: 4c42d5ff284067fa32837421408bebfef996bf81
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1785%2Fderrickstolee%2Ffull-name-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1785/derrickstolee/full-name-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/1785
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 1/4] pack-objects: add --full-name-hash option
  2024-09-09 13:56 [PATCH 0/4] pack-objects: create new name-hash algorithm Derrick Stolee via GitGitGadget
@ 2024-09-09 13:56 ` Derrick Stolee via GitGitGadget
  2024-09-09 17:45   ` Junio C Hamano
  2024-09-09 13:56 ` [PATCH 2/4] git-repack: update usage to match docs Derrick Stolee via GitGitGadget
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 38+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-09-09 13:56 UTC (permalink / raw)
  To: git
  Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <stolee@gmail.com>

The pack_name_hash() method has not been materially changed since it was
introduced in ce0bd64299a (pack-objects: improve path grouping
heuristics., 2006-06-05). The intention here is to group objects by path
name, but also attempt to group similar file types together by making
the most-significant digits of the hash be focused on the final
characters.

Here's the crux of the implementation:

	/*
	 * This effectively just creates a sortable number from the
	 * last sixteen non-whitespace characters. Last characters
	 * count "most", so things that end in ".c" sort together.
	 */
	while ((c = *name++) != 0) {
		if (isspace(c))
			continue;
		hash = (hash >> 2) + (c << 24);
	}

As the comment mentions, this only cares about the last sixteen
non-whitespace characters. This cause some filenames to collide more
than others. Here are some examples that I've seen while investigating
repositories that are growing more than they should be:

 * "/CHANGELOG.json" is 15 characters, and is created by the beachball
   [1] tool. Only the final character of the parent directory can
   differntiate different versions of this file, but also only the two
   most-significant digits. If that character is a letter, then this is
   always a collision. Similar issues occur with the similar
   "/CHANGELOG.md" path, though there is more opportunity for
   differences in the parent directory.

 * Localization files frequently have common filenames but differentiate
   via parent directories. In C#, the name "/strings.resx.lcl" is used
   for these localization files and they will all collide in name-hash.

[1] https://github.com/microsoft/beachball

I've come across many other examples where some internal tool uses a
common name across multiple directories and is causing Git to repack
poorly due to name-hash collisions.

It is clear that the existing name-hash algorithm is optimized for
repositories with short path names, but also is optimized for packing a
single snapshot of a repository, not a repository with many versions of
the same file. In my testing, this has proven out where the name-hash
algorithm does a good job of finding peer files as delta bases when
unable to use a historical version of that exact file.

However, for repositories that have many versions of most files and
directories, it is more important that the objects that appear at the
same path are grouped together.

Create a new pack_full_name_hash() method and a new --full-name-hash
option for 'git pack-objects' to call that method instead. Add a simple
pass-through for 'git repack --full-name-hash' for additional testing in
the context of a full repack, where I expect this will be most
effective.

The hash algorithm is as simple as possible to be reasonably effective:
for each character of the path string, add a multiple of that character
and a large prime number (chosen arbitrarily, but intended to be large
relative to the size of a uint32_t). Then, shift the current hash value
to the right by 5, with overlap. The addition and shift parameters are
standard mechanisms for creating hard-to-predict behaviors in the bits
of the resulting hash.

This is not meant to be cryptographic at all, but uniformly distributed
across the possible hash values. This creates a hash that appears
pseudorandom. There is no ability to consider similar file types as
being close to each other.

In a later change, a test-tool will be added so the effectiveness of
this hash can be demonstrated directly.

For now, let's consider how effective this mechanism is when repacking a
repository with and without the --full-name-hash option. Specifically,
let's use 'git repack -adf [--full-name-hash]' as our test.

On the Git repository, we do not expect much difference. All path names
are short. This is backed by our results:

| Stage                 | Pack Size | Repack Time |
|-----------------------|-----------|-------------|
| After clone           | 260 MB    | N/A         |
| Standard Repack       | 127MB     | 106s        |
| With --full-name-hash | 126 MB    | 99s         |

This example demonstrates how there is some natural overhead coming from
the cloned copy because the server is hosting many forks and has not
optimized for exactly this set of reachable objects. But the full repack
has similar characteristics with and without --full-name-hash.

However, we can test this in a repository that uses one of the
problematic naming conventions above. The fluentui [2] repo uses
beachball to generate CHANGELOG.json and CHANGELOG.md files, and these
files have very poor delta characteristics when comparing against
versions across parent directories.

| Stage                 | Pack Size | Repack Time |
|-----------------------|-----------|-------------|
| After clone           | 694 MB    | N/A         |
| Standard Repack       | 438 MB    | 728s        |
| With --full-name-hash | 168 MB    | 142s        |

[2] https://github.com/microsoft/fluentui

In this example, we see significant gains in the compressed packfile
size as well as the time taken to compute the packfile.

Using a collection of repositories that use the beachball tool, I was
able to make similar comparisions with dramatic results. While the
fluentui repo is public, the others are private so cannot be shared for
reproduction. The results are so significant that I find it important to
share here:

| Repo     | Standard Repack | With --full-name-hash |
|----------|-----------------|-----------------------|
| fluentui |         438 MB  |               168 MB  |
| Repo B   |       6,255 MB  |               829 MB  |
| Repo C   |      37,737 MB  |             7,125 MB  |
| Repo D   |     130,049 MB  |             6,190 MB  |

Future changes could include making --full-name-hash implied by a config
value or even implied by default during a full repack.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 Documentation/git-pack-objects.txt |  3 ++-
 builtin/pack-objects.c             | 20 +++++++++++++++-----
 builtin/repack.c                   |  5 +++++
 pack-objects.h                     | 20 ++++++++++++++++++++
 4 files changed, 42 insertions(+), 6 deletions(-)

diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
index e32404c6aae..93861d9f85b 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -15,7 +15,8 @@ SYNOPSIS
 	[--revs [--unpacked | --all]] [--keep-pack=<pack-name>]
 	[--cruft] [--cruft-expiration=<time>]
 	[--stdout [--filter=<filter-spec>] | <base-name>]
-	[--shallow] [--keep-true-parents] [--[no-]sparse] < <object-list>
+	[--shallow] [--keep-true-parents] [--[no-]sparse]
+	[--full-name-hash] < <object-list>

 DESCRIPTION
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 778be80f564..2a89666d89f 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -266,6 +266,14 @@ struct configured_exclusion {
 static struct oidmap configured_exclusions;

 static struct oidset excluded_by_config;
+static int use_full_name_hash;
+
+static inline uint32_t pack_name_hash_fn(const char *name)
+{
+	if (use_full_name_hash)
+		return pack_full_name_hash(name);
+	return pack_name_hash(name);
+}

 /*
  * stats
@@ -1670,7 +1678,7 @@ static int add_object_entry(const struct object_id *oid, enum object_type type,
 		return 0;
 	}

-	create_object_entry(oid, type, pack_name_hash(name),
+	create_object_entry(oid, type, pack_name_hash_fn(name),
 			    exclude, name && no_try_delta(name),
 			    found_pack, found_offset);
 	return 1;
@@ -1884,7 +1892,7 @@ static void add_preferred_base_object(const char *name)
 {
 	struct pbase_tree *it;
 	size_t cmplen;
-	unsigned hash = pack_name_hash(name);
+	unsigned hash = pack_name_hash_fn(name);

 	if (!num_preferred_base || check_pbase_path(hash))
 		return;
@@ -3394,7 +3402,7 @@ static void show_object_pack_hint(struct object *object, const char *name,
 	 * here using a now in order to perhaps improve the delta selection
 	 * process.
 	 */
-	oe->hash = pack_name_hash(name);
+	oe->hash = pack_name_hash_fn(name);
 	oe->no_try_delta = name && no_try_delta(name);

 	stdin_packs_hints_nr++;
@@ -3544,7 +3552,7 @@ static void add_cruft_object_entry(const struct object_id *oid, enum object_type
 	entry = packlist_find(&to_pack, oid);
 	if (entry) {
 		if (name) {
-			entry->hash = pack_name_hash(name);
+			entry->hash = pack_name_hash_fn(name);
 			entry->no_try_delta = no_try_delta(name);
 		}
 	} else {
@@ -3567,7 +3575,7 @@ static void add_cruft_object_entry(const struct object_id *oid, enum object_type
 			return;
 		}

-		entry = create_object_entry(oid, type, pack_name_hash(name),
+		entry = create_object_entry(oid, type, pack_name_hash_fn(name),
 					    0, name && no_try_delta(name),
 					    pack, offset);
 	}
@@ -4398,6 +4406,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		OPT_STRING_LIST(0, "uri-protocol", &uri_protocols,
 				N_("protocol"),
 				N_("exclude any configured uploadpack.blobpackfileuri with this protocol")),
+		OPT_BOOL(0, "full-name-hash", &use_full_name_hash,
+			 N_("optimize delta compression across identical path names over time")),
 		OPT_END(),
 	};

diff --git a/builtin/repack.c b/builtin/repack.c
index 8bb875532b4..87d0cd4d2f2 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -57,6 +57,7 @@ struct pack_objects_args {
 	int no_reuse_object;
 	int quiet;
 	int local;
+	int full_name_hash;
 	struct list_objects_filter_options filter_options;
 };

@@ -288,6 +289,8 @@ static void prepare_pack_objects(struct child_process *cmd,
 		strvec_pushf(&cmd->args, "--no-reuse-delta");
 	if (args->no_reuse_object)
 		strvec_pushf(&cmd->args, "--no-reuse-object");
+	if (args->full_name_hash)
+		strvec_pushf(&cmd->args, "--full-name-hash");
 	if (args->local)
 		strvec_push(&cmd->args,  "--local");
 	if (args->quiet)
@@ -1176,6 +1179,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 				N_("pass --no-reuse-delta to git-pack-objects")),
 		OPT_BOOL('F', NULL, &po_args.no_reuse_object,
 				N_("pass --no-reuse-object to git-pack-objects")),
+		OPT_BOOL(0, "full-name-hash", &po_args.full_name_hash,
+				N_("pass --full-name-hash to git-pack-objects")),
 		OPT_NEGBIT('n', NULL, &run_update_server_info,
 				N_("do not run git-update-server-info"), 1),
 		OPT__QUIET(&po_args.quiet, N_("be quiet")),
diff --git a/pack-objects.h b/pack-objects.h
index b9898a4e64b..50097552d03 100644
--- a/pack-objects.h
+++ b/pack-objects.h
@@ -207,6 +207,26 @@ static inline uint32_t pack_name_hash(const char *name)
 	return hash;
 }

+static inline uint32_t pack_full_name_hash(const char *name)
+{
+	const uint32_t bigp = 1234572167U;
+	uint32_t c, hash = bigp;
+
+	if (!name)
+		return 0;
+
+	/*
+	 * Just do the dumbest thing possible: add random multiples of a
+	 * large prime number with a binary shift. Goal is not cryptographic,
+	 * but generally uniformly distributed.
+	 */
+	while ((c = *name++) != 0) {
+		hash += c * bigp;
+		hash = (hash >> 5) | (hash << 27);
+	}
+	return hash;
+}
+
 static inline enum object_type oe_type(const struct object_entry *e)
 {
 	return e->type_valid ? e->type_ : OBJ_BAD;
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 2/4] git-repack: update usage to match docs
  2024-09-09 13:56 [PATCH 0/4] pack-objects: create new name-hash algorithm Derrick Stolee via GitGitGadget
  2024-09-09 13:56 ` [PATCH 1/4] pack-objects: add --full-name-hash option Derrick Stolee via GitGitGadget
@ 2024-09-09 13:56 ` Derrick Stolee via GitGitGadget
  2024-09-09 17:48   ` Junio C Hamano
  2024-09-09 13:56 ` [PATCH 3/4] p5313: add size comparison test Derrick Stolee via GitGitGadget
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 38+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-09-09 13:56 UTC (permalink / raw)
  To: git
  Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <stolee@gmail.com>

This also adds the '--full-name-hash' option introduced in the previous
change and adds newlines to the synopsis.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 Documentation/git-repack.txt | 4 +++-
 builtin/repack.c             | 4 +++-
 t/t0450/txt-help-mismatches  | 1 -
 3 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index c902512a9e8..457a793fa89 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -9,7 +9,9 @@ git-repack - Pack unpacked objects in a repository
 SYNOPSIS
 --------
 [verse]
-'git repack' [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [-m] [--window=<n>] [--depth=<n>] [--threads=<n>] [--keep-pack=<pack-name>] [--write-midx]
+'git repack' [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [-m]
+	[--window=<n>] [--depth=<n>] [--threads=<n>] [--keep-pack=<pack-name>]
+	[--write-midx] [--full-name-hash]
 
 DESCRIPTION
 -----------
diff --git a/builtin/repack.c b/builtin/repack.c
index 87d0cd4d2f2..4fa2c25246d 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -38,7 +38,9 @@ static int run_update_server_info = 1;
 static char *packdir, *packtmp_name, *packtmp;
 
 static const char *const git_repack_usage[] = {
-	N_("git repack [<options>]"),
+	N_("git repack [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [-m]\n"
+	   "[--window=<n>] [--depth=<n>] [--threads=<n>] [--keep-pack=<pack-name>]\n"
+	   "[--write-midx] [--full-name-hash]"),
 	NULL
 };
 
diff --git a/t/t0450/txt-help-mismatches b/t/t0450/txt-help-mismatches
index 28003f18c92..c4a15fd0cb8 100644
--- a/t/t0450/txt-help-mismatches
+++ b/t/t0450/txt-help-mismatches
@@ -45,7 +45,6 @@ rebase
 remote
 remote-ext
 remote-fd
-repack
 reset
 restore
 rev-parse
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 3/4] p5313: add size comparison test
  2024-09-09 13:56 [PATCH 0/4] pack-objects: create new name-hash algorithm Derrick Stolee via GitGitGadget
  2024-09-09 13:56 ` [PATCH 1/4] pack-objects: add --full-name-hash option Derrick Stolee via GitGitGadget
  2024-09-09 13:56 ` [PATCH 2/4] git-repack: update usage to match docs Derrick Stolee via GitGitGadget
@ 2024-09-09 13:56 ` Derrick Stolee via GitGitGadget
  2024-09-09 13:56 ` [PATCH 4/4] p5314: add a size test for name-hash collisions Derrick Stolee via GitGitGadget
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 38+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-09-09 13:56 UTC (permalink / raw)
  To: git
  Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <stolee@gmail.com>

As custom options are added to 'git pack-objects' and 'git repack' to
adjust how compression is done, use this new performance test script to
demonstrate their effectiveness in performance and size.

The recently-added --full-name-hash option swaps the default name-hash
algorithm with one that attempts to uniformly distribute the hashes
based on the full path name instead of the last 16 characters.

This has a dramatic effect on full repacks for repositories with many
versions of most paths. It can have a negative impact on cases such as
pushing a single change.

This can be seen by running pt5313 on the open source fluentui
repository [1]. Most commits will have this kind of output for the thin
and big pack cases, though certain commits (such as [2]) will have
problematic thin pack size for other reasons.

[1] https://github.com/microsoft/fluentui
[2] a637a06df05360ce5ff21420803f64608226a875

Checked out at the parent of [2], I see the following statistics:

Test                                           this tree
------------------------------------------------------------------
5313.2: thin pack                              0.02(0.01+0.01)
5313.3: thin pack size                                    1.1K
5313.4: thin pack with --full-name-hash        0.02(0.01+0.00)
5313.5: thin pack size with --full-name-hash              3.0K
5313.6: big pack                               1.65(3.35+0.24)
5313.7: big pack size                                    58.0M
5313.8: big pack with --full-name-hash         1.53(2.52+0.18)
5313.9: big pack size with --full-name-hash              57.6M
5313.10: repack                                176.52(706.60+3.53)
5313.11: repack size                                    446.7K
5313.12: repack with --full-name-hash          37.47(134.18+3.06)
5313.13: repack size with --full-name-hash              183.1K

Note that this demonstrates a 3x size _increase_ in the case that
simulates a small "git push". The size change is neutral on the case of
pushing the difference between HEAD and HEAD~1000.

However, the full repack case is both faster and more efficient.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 t/perf/p5313-pack-objects.sh | 71 ++++++++++++++++++++++++++++++++++++
 1 file changed, 71 insertions(+)
 create mode 100755 t/perf/p5313-pack-objects.sh

diff --git a/t/perf/p5313-pack-objects.sh b/t/perf/p5313-pack-objects.sh
new file mode 100755
index 00000000000..88ea0a4698a
--- /dev/null
+++ b/t/perf/p5313-pack-objects.sh
@@ -0,0 +1,71 @@
+#!/bin/sh
+
+test_description='Tests pack performance using bitmaps'
+. ./perf-lib.sh
+
+GIT_TEST_PASSING_SANITIZE_LEAK=0
+export GIT_TEST_PASSING_SANITIZE_LEAK
+
+test_perf_large_repo
+
+test_expect_success 'create rev input' '
+	cat >in-thin <<-EOF &&
+	$(git rev-parse HEAD)
+	^$(git rev-parse HEAD~1)
+	EOF
+
+	cat >in-big <<-EOF
+	$(git rev-parse HEAD)
+	^$(git rev-parse HEAD~1000)
+	EOF
+'
+
+test_perf 'thin pack' '
+	git pack-objects --thin --stdout --revs --sparse  <in-thin >out
+'
+
+test_size 'thin pack size' '
+	wc -c <out
+'
+
+test_perf 'thin pack with --full-name-hash' '
+	git pack-objects --thin --stdout --revs --sparse --full-name-hash <in-thin >out
+'
+
+test_size 'thin pack size with --full-name-hash' '
+	wc -c <out
+'
+
+test_perf 'big pack' '
+	git pack-objects --stdout --revs --sparse  <in-big >out
+'
+
+test_size 'big pack size' '
+	wc -c <out
+'
+
+test_perf 'big pack with --full-name-hash' '
+	git pack-objects --stdout --revs --sparse --full-name-hash <in-big >out
+'
+
+test_size 'big pack size with --full-name-hash' '
+	wc -c <out
+'
+
+test_perf 'repack' '
+	git repack -adf
+'
+
+test_size 'repack size' '
+	du -a .git/objects/pack | sort -nr | awk "{ print \$1; }" | head -n 1
+'
+
+test_perf 'repack with --full-name-hash' '
+	git repack -adf --full-name-hash
+'
+
+test_size 'repack size with --full-name-hash' '
+	du -a .git/objects/pack | sort -nr | awk "{ print \$1; }" | head -n 1
+'
+
+test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 4/4] p5314: add a size test for name-hash collisions
  2024-09-09 13:56 [PATCH 0/4] pack-objects: create new name-hash algorithm Derrick Stolee via GitGitGadget
                   ` (2 preceding siblings ...)
  2024-09-09 13:56 ` [PATCH 3/4] p5313: add size comparison test Derrick Stolee via GitGitGadget
@ 2024-09-09 13:56 ` Derrick Stolee via GitGitGadget
  2024-09-09 16:07 ` [PATCH 0/4] pack-objects: create new name-hash algorithm Derrick Stolee
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 38+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-09-09 13:56 UTC (permalink / raw)
  To: git
  Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <stolee@gmail.com>

Add a new test-tool helper, name-hash, to output the value of the
name-hash algorithms for the input list of strings, one per line.

Create a performance test that uses test_size to demonstrate how
collisions occur for these hash algorithms. This test helps inform
someone as to the behavior of the name-hash algorithms for their repo
based on the paths at HEAD.

My copy of the Git repository shows modest statistics around the
collisions of the default name-hash algorithm:

Test                                              this tree
-----------------------------------------------------------------
5314.1: paths at head                                        4.5K
5314.2: number of distinct name-hashes                       4.1K
5314.3: number of distinct full-name-hashes                  4.5K
5314.4: maximum multiplicity of name-hashes                    13
5314.5: maximum multiplicity of fullname-hashes                 1

Here, the maximum collision multiplicity is 13, but around 10% of paths
have a collision with another path.

In a more interesting example, the microsoft/fluentui [1] repo had these
statistics at time of committing:

Test                                              this tree
-----------------------------------------------------------------
5314.1: paths at head                                       19.6K
5314.2: number of distinct name-hashes                       8.2K
5314.3: number of distinct full-name-hashes                 19.6K
5314.4: maximum multiplicity of name-hashes                   279
5314.5: maximum multiplicity of fullname-hashes                 1

[1] https://github.com/microsoft/fluentui

That demonstrates that of the nearly twenty thousand path names, they
are assigned around eight thousand distinct values. 279 paths are
assigned to a single value, leading the packing algorithm to sort
objects from those paths together, by size.

In this repository, no collisions occur for the full-name-hash
algorithm.

In a more extreme example, an internal monorepo had a much worse
collision rate:

Test                                              this tree
-----------------------------------------------------------------
5314.1: paths at head                                      221.6K
5314.2: number of distinct name-hashes                      72.0K
5314.3: number of distinct full-name-hashes                221.6K
5314.4: maximum multiplicity of name-hashes                 14.4K
5314.5: maximum multiplicity of fullname-hashes                 2

Even in this repository with many more paths at HEAD, the collision rate
was low and the maximum number of paths being grouped into a single
bucket by the full-path-name algorithm was two.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 Makefile                  |  1 +
 t/helper/test-name-hash.c | 23 ++++++++++++++++++++++
 t/helper/test-tool.c      |  1 +
 t/helper/test-tool.h      |  1 +
 t/perf/p5314-name-hash.sh | 41 +++++++++++++++++++++++++++++++++++++++
 5 files changed, 67 insertions(+)
 create mode 100644 t/helper/test-name-hash.c
 create mode 100755 t/perf/p5314-name-hash.sh

diff --git a/Makefile b/Makefile
index 91f65d7dc57..a5ce284cf4a 100644
--- a/Makefile
+++ b/Makefile
@@ -812,6 +812,7 @@ TEST_BUILTINS_OBJS += test-lazy-init-name-hash.o
 TEST_BUILTINS_OBJS += test-match-trees.o
 TEST_BUILTINS_OBJS += test-mergesort.o
 TEST_BUILTINS_OBJS += test-mktemp.o
+TEST_BUILTINS_OBJS += test-name-hash.o
 TEST_BUILTINS_OBJS += test-oid-array.o
 TEST_BUILTINS_OBJS += test-online-cpus.o
 TEST_BUILTINS_OBJS += test-pack-mtimes.o
diff --git a/t/helper/test-name-hash.c b/t/helper/test-name-hash.c
new file mode 100644
index 00000000000..15fb8f853c1
--- /dev/null
+++ b/t/helper/test-name-hash.c
@@ -0,0 +1,23 @@
+/*
+ * test-name-hash.c: Read a list of paths over stdin and report on their
+ * name-hash and full name-hash.
+ */
+
+#include "test-tool.h"
+#include "git-compat-util.h"
+#include "pack-objects.h"
+#include "strbuf.h"
+
+int cmd__name_hash(int argc UNUSED, const char **argv UNUSED)
+{
+	struct strbuf line = STRBUF_INIT;
+
+	while (!strbuf_getline(&line, stdin)) {
+		uint32_t name_hash = pack_name_hash(line.buf);
+		uint32_t full_hash = pack_full_name_hash(line.buf);
+
+		printf("%10"PRIu32"\t%10"PRIu32"\t%s\n", name_hash, full_hash, line.buf);
+	}
+
+	return 0;
+}
diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
index f8a67df7de9..4a603921002 100644
--- a/t/helper/test-tool.c
+++ b/t/helper/test-tool.c
@@ -43,6 +43,7 @@ static struct test_cmd cmds[] = {
 	{ "match-trees", cmd__match_trees },
 	{ "mergesort", cmd__mergesort },
 	{ "mktemp", cmd__mktemp },
+	{ "name-hash", cmd__name_hash },
 	{ "oid-array", cmd__oid_array },
 	{ "online-cpus", cmd__online_cpus },
 	{ "pack-mtimes", cmd__pack_mtimes },
diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index e74bc0ffd41..56a83bf3aac 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h
@@ -37,6 +37,7 @@ int cmd__lazy_init_name_hash(int argc, const char **argv);
 int cmd__match_trees(int argc, const char **argv);
 int cmd__mergesort(int argc, const char **argv);
 int cmd__mktemp(int argc, const char **argv);
+int cmd__name_hash(int argc, const char **argv);
 int cmd__online_cpus(int argc, const char **argv);
 int cmd__pack_mtimes(int argc, const char **argv);
 int cmd__parse_options(int argc, const char **argv);
diff --git a/t/perf/p5314-name-hash.sh b/t/perf/p5314-name-hash.sh
new file mode 100755
index 00000000000..9fe26612fac
--- /dev/null
+++ b/t/perf/p5314-name-hash.sh
@@ -0,0 +1,41 @@
+#!/bin/sh
+
+test_description='Tests pack performance using bitmaps'
+. ./perf-lib.sh
+
+GIT_TEST_PASSING_SANITIZE_LEAK=0
+export GIT_TEST_PASSING_SANITIZE_LEAK
+
+test_perf_large_repo
+
+test_size 'paths at head' '
+	git ls-tree -r --name-only HEAD >path-list &&
+	wc -l <path-list
+'
+
+test_size 'number of distinct name-hashes' '
+	cat path-list | test-tool name-hash >name-hashes &&
+	cat name-hashes | awk "{ print \$1; }" | sort -n | uniq -c >name-hash-count &&
+	wc -l <name-hash-count
+'
+
+test_size 'number of distinct full-name-hashes' '
+	cat name-hashes | awk "{ print \$2; }" | sort -n | uniq -c >full-name-hash-count &&
+	wc -l <full-name-hash-count
+'
+
+test_size 'maximum multiplicity of name-hashes' '
+	cat name-hash-count | \
+		sort -nr | \
+		head -n 1 | \
+		awk "{ print \$1; }"
+'
+
+test_size 'maximum multiplicity of fullname-hashes' '
+	cat full-name-hash-count | \
+		sort -nr | \
+		head -n 1 | \
+		awk "{ print \$1; }"
+'
+
+test_done
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/4] pack-objects: create new name-hash algorithm
  2024-09-09 13:56 [PATCH 0/4] pack-objects: create new name-hash algorithm Derrick Stolee via GitGitGadget
                   ` (3 preceding siblings ...)
  2024-09-09 13:56 ` [PATCH 4/4] p5314: add a size test for name-hash collisions Derrick Stolee via GitGitGadget
@ 2024-09-09 16:07 ` Derrick Stolee
  2024-09-09 18:06 ` Junio C Hamano
  2024-09-18 20:46 ` [PATCH v2 0/6] " Derrick Stolee via GitGitGadget
  6 siblings, 0 replies; 38+ messages in thread
From: Derrick Stolee @ 2024-09-09 16:07 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget, git
  Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren

On 9/9/24 9:56 AM, Derrick Stolee via GitGitGadget wrote:

> However, my findings show that when a repository has many versions of files
> at the same path (and especially when there are many name-hash collisions)
> then there are significant gains to be made using the new algorithm.

Of course this table didn't render correctly. Here's a readable version:

| Repo     | Standard Repack | With --full-name-hash |
|----------|-----------------|-----------------------|
| fluentui |         438 MB  |               168 MB  |
| Repo B   |       6,255 MB  |               829 MB  |
| Repo C   |      37,737 MB  |             7,125 MB  |
| Repo D   |     130,049 MB  |             6,190 MB  |

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/4] pack-objects: add --full-name-hash option
  2024-09-09 13:56 ` [PATCH 1/4] pack-objects: add --full-name-hash option Derrick Stolee via GitGitGadget
@ 2024-09-09 17:45   ` Junio C Hamano
  2024-09-10  2:31     ` Derrick Stolee
  0 siblings, 1 reply; 38+ messages in thread
From: Junio C Hamano @ 2024-09-09 17:45 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, johannes.schindelin, peff, ps, me, johncai86, newren,
	Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> This is not meant to be cryptographic at all, but uniformly distributed
> across the possible hash values. This creates a hash that appears
> pseudorandom. There is no ability to consider similar file types as
> being close to each other.

Another consideration we had when designing the current mechanism,
which is more important than "compare .c files with each other", is
to handle the case where a file is moved across directory boundary
without changing its name.  These "hash collissions" are meant to be
a part of obtaining _good_ paring of blobs that ought to be similar
to each other.  In other words, we wanted them to collide so that we
do not have to be negatively affected by moves.

I am not saying that we should not update the pack name hash; I am
just saying that "consider similar file types" as if that is the
most important aspect of the current hash, is misleading.

Thanks.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/4] git-repack: update usage to match docs
  2024-09-09 13:56 ` [PATCH 2/4] git-repack: update usage to match docs Derrick Stolee via GitGitGadget
@ 2024-09-09 17:48   ` Junio C Hamano
  0 siblings, 0 replies; 38+ messages in thread
From: Junio C Hamano @ 2024-09-09 17:48 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, johannes.schindelin, peff, ps, me, johncai86, newren,
	Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> diff --git a/t/t0450/txt-help-mismatches b/t/t0450/txt-help-mismatches
> index 28003f18c92..c4a15fd0cb8 100644
> --- a/t/t0450/txt-help-mismatches
> +++ b/t/t0450/txt-help-mismatches
> @@ -45,7 +45,6 @@ rebase
>  remote
>  remote-ext
>  remote-fd
> -repack
>  reset
>  restore
>  rev-parse

This is very much appreciated ;-)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/4] pack-objects: create new name-hash algorithm
  2024-09-09 13:56 [PATCH 0/4] pack-objects: create new name-hash algorithm Derrick Stolee via GitGitGadget
                   ` (4 preceding siblings ...)
  2024-09-09 16:07 ` [PATCH 0/4] pack-objects: create new name-hash algorithm Derrick Stolee
@ 2024-09-09 18:06 ` Junio C Hamano
  2024-09-10  2:37   ` Derrick Stolee
  2024-09-18 20:46 ` [PATCH v2 0/6] " Derrick Stolee via GitGitGadget
  6 siblings, 1 reply; 38+ messages in thread
From: Junio C Hamano @ 2024-09-09 18:06 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, johannes.schindelin, peff, ps, me, johncai86, newren,
	Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> One way to find some improvement in these repositories is to increase the
> window size, which was an initial indicator that the delta compression could
> be improved, but was not a clear indicator. After some digging (and
> prototyping some analysis tools) the main discovery was that the current
> name-hash algorithm only considers the last 16 characters in the path name
> and has some naturally-occurring collisions within that scope.

Yes, as I explained in the other message, this "collision" is an
integral part of the design to allow us gather candidates together
that may yield good deltas among them.  In addition, header files
whose names end with ".h" tend to share a bit comment at the
beginning of them in many projects, and the proximity (not
"collision") of the hash value is used to make them delta candidates
with each other.

I do agree that considering files at the same path from different
(but close-by) revisions as the prime candidates is very important,
but if you spread the "collissions" very thin by using "uniform
distribution", I am afraid that you'd end up discarding anything but
the blobs at the same path, which may go too far.  Having name hash
value that are close by no longer has any meaning in such a system.

I hope you can find a solution that strikes a good balance at the
end of the series (I saw only the first step), but I suspect an easy
way to avoid the downsides you observed is to use both.  Compare
with a handful of blobs taken from nearby commits (the original
object order is roughly in traversal order, and you can take
advantage of that fact) from exactly the same path (using your
"uniform distribution") before comparing with the blobs with close
value (of the current function) like the current implementation
does, may go a long way.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/4] pack-objects: add --full-name-hash option
  2024-09-09 17:45   ` Junio C Hamano
@ 2024-09-10  2:31     ` Derrick Stolee
  0 siblings, 0 replies; 38+ messages in thread
From: Derrick Stolee @ 2024-09-10  2:31 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, johannes.schindelin, peff, ps, me, johncai86, newren

On 9/9/24 1:45 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> This is not meant to be cryptographic at all, but uniformly distributed
>> across the possible hash values. This creates a hash that appears
>> pseudorandom. There is no ability to consider similar file types as
>> being close to each other.
> 
> Another consideration we had when designing the current mechanism,
> which is more important than "compare .c files with each other", is
> to handle the case where a file is moved across directory boundary
> without changing its name.  These "hash collissions" are meant to be
> a part of obtaining _good_ paring of blobs that ought to be similar
> to each other.  In other words, we wanted them to collide so that we
> do not have to be negatively affected by moves.
> 
> I am not saying that we should not update the pack name hash; I am
> just saying that "consider similar file types" as if that is the
> most important aspect of the current hash, is misleading.

Thank you for this extra aspect, which has clarified some of my
thinking and I will use in future versions.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/4] pack-objects: create new name-hash algorithm
  2024-09-09 18:06 ` Junio C Hamano
@ 2024-09-10  2:37   ` Derrick Stolee
  2024-09-10 14:56     ` Taylor Blau
  2024-09-10 16:07     ` Junio C Hamano
  0 siblings, 2 replies; 38+ messages in thread
From: Derrick Stolee @ 2024-09-10  2:37 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, johannes.schindelin, peff, ps, me, johncai86, newren

On 9/9/24 2:06 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> One way to find some improvement in these repositories is to increase the
>> window size, which was an initial indicator that the delta compression could
>> be improved, but was not a clear indicator. After some digging (and
>> prototyping some analysis tools) the main discovery was that the current
>> name-hash algorithm only considers the last 16 characters in the path name
>> and has some naturally-occurring collisions within that scope.
> 
> Yes, as I explained in the other message, this "collision" is an
> integral part of the design to allow us gather candidates together
> that may yield good deltas among them.  In addition, header files
> whose names end with ".h" tend to share a bit comment at the
> beginning of them in many projects, and the proximity (not
> "collision") of the hash value is used to make them delta candidates
> with each other.
> 
> I do agree that considering files at the same path from different
> (but close-by) revisions as the prime candidates is very important,
> but if you spread the "collissions" very thin by using "uniform
> distribution", I am afraid that you'd end up discarding anything but
> the blobs at the same path, which may go too far.  Having name hash
> value that are close by no longer has any meaning in such a system.

You are right that some "nearby" paths are lost in this change, and
this can be measured by trying to use this option in the pack-objects
process underneath a small 'git push'.

The thing that surprised me is just how effective this is for the
creation of large pack-files that include many versions of most
files. The cross-path deltas have less of an effect here, and the
benefits of avoiding name-hash collisions can be overwhelming in
many cases.

> I hope you can find a solution that strikes a good balance at the
> end of the series (I saw only the first step), but I suspect an easy
> way to avoid the downsides you observed is to use both.  Compare
> with a handful of blobs taken from nearby commits (the original
> object order is roughly in traversal order, and you can take
> advantage of that fact) from exactly the same path (using your
> "uniform distribution") before comparing with the blobs with close
> value (of the current function) like the current implementation
> does, may go a long way.

Funny you should say that, since the RFC I finally submitted [1]
actually does just that. The --path-walk option changes the object
walk to consider batches of objects based on their path, computes
deltas among that batch, and then does the normal name-hash pass
later. This seems to really strike the balance that you are
looking for and solves the issues where small pushes need to stay
small. It also fixes some problematic cases even when pushing a
single commit.

[1] https://lore.kernel.org/git/pull.1786.git.1725935335.gitgitgadget@gmail.com/

However, the --path-walk option requires significant implementation
of a "path walk API" and my RFC doesn't even do threading right.
The --path-walk version (probably) doesn't work with delta islands
or other features the same way as the drop-in change to the name-
hash heuristic can. For that reason, I think there is likely some
long-term value to the --full-name-hash option even though the
--path-walk option will be better in many cases.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/4] pack-objects: create new name-hash algorithm
  2024-09-10  2:37   ` Derrick Stolee
@ 2024-09-10 14:56     ` Taylor Blau
  2024-09-10 21:05       ` Derrick Stolee
  2024-09-10 16:07     ` Junio C Hamano
  1 sibling, 1 reply; 38+ messages in thread
From: Taylor Blau @ 2024-09-10 14:56 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Junio C Hamano, Derrick Stolee via GitGitGadget, git,
	johannes.schindelin, peff, ps, johncai86, newren

On Mon, Sep 09, 2024 at 10:37:30PM -0400, Derrick Stolee wrote:
> > I do agree that considering files at the same path from different
> > (but close-by) revisions as the prime candidates is very important,
> > but if you spread the "collissions" very thin by using "uniform
> > distribution", I am afraid that you'd end up discarding anything but
> > the blobs at the same path, which may go too far.  Having name hash
> > value that are close by no longer has any meaning in such a system.
>
> You are right that some "nearby" paths are lost in this change, and
> this can be measured by trying to use this option in the pack-objects
> process underneath a small 'git push'.
>
> The thing that surprised me is just how effective this is for the
> creation of large pack-files that include many versions of most
> files. The cross-path deltas have less of an effect here, and the
> benefits of avoiding name-hash collisions can be overwhelming in
> many cases.

I think that Junio's suggestion is pretty interesting (though please
take my comments here with a grain of salt, since I haven't read the
other series yet, and am not sure how much of this is redundant).

Imagine computing both the full and existing name-hash values for each
blob/tree in the pack. Then objects would be sorted in the delta
selection window by similar full-name hash and similar regular name hash
values.

I'm not sure which value you'd actually record in the pack, though.
Ideally there is a hash function which captures some information about
the full path as well as the final path component, so we could use a
single value here, though I suspect the implementation would be more
complicated than what is presented here.

> > I hope you can find a solution that strikes a good balance at the
> > end of the series (I saw only the first step), but I suspect an easy
> > way to avoid the downsides you observed is to use both.  Compare
> > with a handful of blobs taken from nearby commits (the original
> > object order is roughly in traversal order, and you can take
> > advantage of that fact) from exactly the same path (using your
> > "uniform distribution") before comparing with the blobs with close
> > value (of the current function) like the current implementation
> > does, may go a long way.
>
> Funny you should say that, since the RFC I finally submitted [1]
> actually does just that. The --path-walk option changes the object
> walk to consider batches of objects based on their path, computes
> deltas among that batch, and then does the normal name-hash pass
> later. This seems to really strike the balance that you are
> looking for and solves the issues where small pushes need to stay
> small. It also fixes some problematic cases even when pushing a
> single commit.

Interesting.

> [1] https://lore.kernel.org/git/pull.1786.git.1725935335.gitgitgadget@gmail.com/

> However, the --path-walk option requires significant implementation
> of a "path walk API" and my RFC doesn't even do threading right.
> The --path-walk version (probably) doesn't work with delta islands
> or other features the same way as the drop-in change to the name-
> hash heuristic can. For that reason, I think there is likely some
> long-term value to the --full-name-hash option even though the
> --path-walk option will be better in many cases.

I suspect that this is going to be a significant sticking point. Not
supporting multi-threading is work-able for GitHub (since we set
pack.threads=1 today), but lacking support for delta-islands makes this
a non-starter to run at GitHub.

Do you imagine that the --path-walk option could be made to work with
delta islands? I'm not all that worried about who does that work, but
more interested at this point in whether or not it's even possible to
implement.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/4] pack-objects: create new name-hash algorithm
  2024-09-10  2:37   ` Derrick Stolee
  2024-09-10 14:56     ` Taylor Blau
@ 2024-09-10 16:07     ` Junio C Hamano
  2024-09-10 20:36       ` Junio C Hamano
  1 sibling, 1 reply; 38+ messages in thread
From: Junio C Hamano @ 2024-09-10 16:07 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, johannes.schindelin, peff,
	ps, me, johncai86, newren

Derrick Stolee <stolee@gmail.com> writes:

> The thing that surprised me is just how effective this is for the
> creation of large pack-files that include many versions of most
> files. The cross-path deltas have less of an effect here, and the
> benefits of avoiding name-hash collisions can be overwhelming in
> many cases.

Yes, "make sure we notice a file F moving from directory A to B" is
inherently optimized for short span of history, i.e. a smallish push
rather than a whole history clone, where the definition of
"smallish" is that even if you create optimal delta chains, the
length of these delta chains will not exceed the "--depth" option.

If the history you are pushing modified A/F twice, renamed it to B/F
(with or without modification at the same time), then modified B/F
twice more, you'd want to pack the 5-commit segment and having to
artificially cut the delta chain that can contain all of these 5
blobs into two at the renaming commit is a huge loss.

Compared to that, a whole history clone is a very different story,
as we will have to chomp delta chains at some depth anyway.  Before
the rename, it is reasonable to assume that A/F have evolved
incrementally for number of revisions, and after that rename it is
expected B/F will evolve incrementally for number of revisions
before it gets renamed again.  It is just the matter of choosing
where in that long stretch of content evolution we would cut the
delta chain, and the commit that renamed the path may just be a
good, if not absolute optimal, point.

So I do agree that this is an important case to optimize for.  At
some point, even when taking only the blobs at the same path as
delta base candidates, your true best base may be outside of the
--window in the list of candidates (sorted by size in decreasing
order), but at that point you would be increasing window to find
better delta base, not to skip unrelated blobs that happened to have
thrown into the same hash bucket due to the design that optimizes
for different case, so we can say that it is worth spending the
extra cycle and memory, if you need a larger window to gain even
better packing result.

> Funny you should say that, since the RFC I finally submitted [1]
> actually does just that. The --path-walk option changes the object
> walk to consider batches of objects based on their path, computes
> deltas among that batch, and then does the normal name-hash pass
> later. This seems to really strike the balance that you are
> looking for and solves the issues where small pushes need to stay
> small. It also fixes some problematic cases even when pushing a
> single commit.

;-).

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/4] pack-objects: create new name-hash algorithm
  2024-09-10 16:07     ` Junio C Hamano
@ 2024-09-10 20:36       ` Junio C Hamano
  2024-09-10 21:09         ` Derrick Stolee
  0 siblings, 1 reply; 38+ messages in thread
From: Junio C Hamano @ 2024-09-10 20:36 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, johannes.schindelin, peff,
	ps, me, johncai86, newren

Junio C Hamano <gitster@pobox.com> writes:

> Derrick Stolee <stolee@gmail.com> writes:
>
>> The thing that surprised me is just how effective this is for the
>> creation of large pack-files that include many versions of most
>> files. The cross-path deltas have less of an effect here, and the
>> benefits of avoiding name-hash collisions can be overwhelming in
>> many cases.
>
> Yes, "make sure we notice a file F moving from directory A to B" is
> inherently optimized for short span of history, i.e. a smallish push
> rather than a whole history clone, where the definition of
> "smallish" is that even if you create optimal delta chains, the
> length of these delta chains will not exceed the "--depth" option.
>
> If the history you are pushing modified A/F twice, renamed it to B/F
> (with or without modification at the same time), then modified B/F
> twice more, you'd want to pack the 5-commit segment and having to
> artificially cut the delta chain that can contain all of these 5
> blobs into two at the renaming commit is a huge loss.

Which actually leads me to suspect that we probably do not even have
to expose the --full-name-hash option to the end users in "git repack".

If we are doing incremental that would fit within the depth setting,
it is likely that we would be better off without the full-name-hash
optimization, and if we are doing "repack -a" for the whole
repository, especially with "-f", it would make sense to do the
full-name-hash optimization.

If we can tell how large a chunk of history we are packing before we
actually start calling builtin/pack-objects.c:add_object_entry(), we
probably should be able to even select between with and without
full-name-hash automatically, but I do not think we know the object
count before we finish calling add_object_entry(), so unless we are
willing to compute and keep both while reading and pick between the
two after we finish reading the list of objects, or something, it
will require a major surgery to do so, I am afraid.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/4] pack-objects: create new name-hash algorithm
  2024-09-10 14:56     ` Taylor Blau
@ 2024-09-10 21:05       ` Derrick Stolee
  2024-09-11  6:32         ` Jeff King
  0 siblings, 1 reply; 38+ messages in thread
From: Derrick Stolee @ 2024-09-10 21:05 UTC (permalink / raw)
  To: Taylor Blau
  Cc: Junio C Hamano, Derrick Stolee via GitGitGadget, git,
	johannes.schindelin, peff, ps, johncai86, newren

On 9/10/24 10:56 AM, Taylor Blau wrote:
> On Mon, Sep 09, 2024 at 10:37:30PM -0400, Derrick Stolee wrote:
>>> I do agree that considering files at the same path from different
>>> (but close-by) revisions as the prime candidates is very important,
>>> but if you spread the "collissions" very thin by using "uniform
>>> distribution", I am afraid that you'd end up discarding anything but
>>> the blobs at the same path, which may go too far.  Having name hash
>>> value that are close by no longer has any meaning in such a system.
>>
>> You are right that some "nearby" paths are lost in this change, and
>> this can be measured by trying to use this option in the pack-objects
>> process underneath a small 'git push'.
>>
>> The thing that surprised me is just how effective this is for the
>> creation of large pack-files that include many versions of most
>> files. The cross-path deltas have less of an effect here, and the
>> benefits of avoiding name-hash collisions can be overwhelming in
>> many cases.
> 
> I think that Junio's suggestion is pretty interesting (though please
> take my comments here with a grain of salt, since I haven't read the
> other series yet, and am not sure how much of this is redundant).
> 
> Imagine computing both the full and existing name-hash values for each
> blob/tree in the pack. Then objects would be sorted in the delta
> selection window by similar full-name hash and similar regular name hash
> values.
> 
> I'm not sure which value you'd actually record in the pack, though.
> Ideally there is a hash function which captures some information about
> the full path as well as the final path component, so we could use a
> single value here, though I suspect the implementation would be more
> complicated than what is presented here.

Is the name hash stored in the pack itself? I know that it is stored
in the 'struct object_entry' data in the packing data. While we could
add another uint32_t into that struct to store both hash values, this
would increase the memory requirements of repacking by four bytes per
object. The struct seemed to be very clear about trying as hard as
possible to avoid doing that.

But maybe an alternative could be replacing that 32-bit number with
an index into an array of paths that have their hash values stored
there.

This would still involve two passes, but might still be possible. I'll
think on this.

>>> I hope you can find a solution that strikes a good balance at the
>>> end of the series (I saw only the first step), but I suspect an easy
>>> way to avoid the downsides you observed is to use both.  Compare
>>> with a handful of blobs taken from nearby commits (the original
>>> object order is roughly in traversal order, and you can take
>>> advantage of that fact) from exactly the same path (using your
>>> "uniform distribution") before comparing with the blobs with close
>>> value (of the current function) like the current implementation
>>> does, may go a long way.
>>
>> Funny you should say that, since the RFC I finally submitted [1]
>> actually does just that. The --path-walk option changes the object
>> walk to consider batches of objects based on their path, computes
>> deltas among that batch, and then does the normal name-hash pass
>> later. This seems to really strike the balance that you are
>> looking for and solves the issues where small pushes need to stay
>> small. It also fixes some problematic cases even when pushing a
>> single commit.
> 
> Interesting.
> 
>> [1] https://lore.kernel.org/git/pull.1786.git.1725935335.gitgitgadget@gmail.com/
> 
>> However, the --path-walk option requires significant implementation
>> of a "path walk API" and my RFC doesn't even do threading right.
>> The --path-walk version (probably) doesn't work with delta islands
>> or other features the same way as the drop-in change to the name-
>> hash heuristic can. For that reason, I think there is likely some
>> long-term value to the --full-name-hash option even though the
>> --path-walk option will be better in many cases.
> 
> I suspect that this is going to be a significant sticking point. Not
> supporting multi-threading is work-able for GitHub (since we set
> pack.threads=1 today), but lacking support for delta-islands makes this
> a non-starter to run at GitHub.
> 
> Do you imagine that the --path-walk option could be made to work with
> delta islands? I'm not all that worried about who does that work, but
> more interested at this point in whether or not it's even possible to
> implement.
This is part of the reason why I think the --full-name-hash option is
an interesting consideration. It doesn't have any obvious reason why
it couldn't work with features like delta islands, so it may provide
some quick wins in "large enough" repositories, or at least "large in
the right way".

I unfortunately don't know enough about how the delta islands feature
works to be confident in the possibility of integrating it with the
--path-walk option. At minimum, it would require two object walks:
the first would mark the objects and the second would do the delta
compression with those markings in mind.

But if there is a way to combine both approaches with a two-pass
delta compression technique, then this may be all avoided. I'll give
it a try.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/4] pack-objects: create new name-hash algorithm
  2024-09-10 20:36       ` Junio C Hamano
@ 2024-09-10 21:09         ` Derrick Stolee
  0 siblings, 0 replies; 38+ messages in thread
From: Derrick Stolee @ 2024-09-10 21:09 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Derrick Stolee via GitGitGadget, git, johannes.schindelin, peff,
	ps, me, johncai86, newren

On 9/10/24 4:36 PM, Junio C Hamano wrote:
> Junio C Hamano <gitster@pobox.com> writes:
> 
>> Derrick Stolee <stolee@gmail.com> writes:
>>
>>> The thing that surprised me is just how effective this is for the
>>> creation of large pack-files that include many versions of most
>>> files. The cross-path deltas have less of an effect here, and the
>>> benefits of avoiding name-hash collisions can be overwhelming in
>>> many cases.
>>
>> Yes, "make sure we notice a file F moving from directory A to B" is
>> inherently optimized for short span of history, i.e. a smallish push
>> rather than a whole history clone, where the definition of
>> "smallish" is that even if you create optimal delta chains, the
>> length of these delta chains will not exceed the "--depth" option.
>>
>> If the history you are pushing modified A/F twice, renamed it to B/F
>> (with or without modification at the same time), then modified B/F
>> twice more, you'd want to pack the 5-commit segment and having to
>> artificially cut the delta chain that can contain all of these 5
>> blobs into two at the renaming commit is a huge loss.
> 
> Which actually leads me to suspect that we probably do not even have
> to expose the --full-name-hash option to the end users in "git repack".
> 
> If we are doing incremental that would fit within the depth setting,
> it is likely that we would be better off without the full-name-hash
> optimization, and if we are doing "repack -a" for the whole
> repository, especially with "-f", it would make sense to do the
> full-name-hash optimization.

Depending on how much we learn from others testing the --full-name-hash
option, I could see the potential that -a could imply --full-name-hash.
I hesitate to introduce that in the first release with this option,
though.

> If we can tell how large a chunk of history we are packing before we
> actually start calling builtin/pack-objects.c:add_object_entry(), we
> probably should be able to even select between with and without
> full-name-hash automatically, but I do not think we know the object
> count before we finish calling add_object_entry(), so unless we are
> willing to compute and keep both while reading and pick between the
> two after we finish reading the list of objects, or something, it
> will require a major surgery to do so, I am afraid.

It's also possible that we could check the list of paths at HEAD to
see how many collisions the default name-hash gives. In cases like
the Git repository, there are very few collisions and thus we don't
need to use --full-name-hash. Restricting to just HEAD (or the
default ref) is not a complete analysis, but might be a good
heuristic.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/4] pack-objects: create new name-hash algorithm
  2024-09-10 21:05       ` Derrick Stolee
@ 2024-09-11  6:32         ` Jeff King
  0 siblings, 0 replies; 38+ messages in thread
From: Jeff King @ 2024-09-11  6:32 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Taylor Blau, Junio C Hamano, Derrick Stolee via GitGitGadget, git,
	johannes.schindelin, ps, johncai86, newren

On Tue, Sep 10, 2024 at 05:05:09PM -0400, Derrick Stolee wrote:

> > I'm not sure which value you'd actually record in the pack, though.
> > Ideally there is a hash function which captures some information about
> > the full path as well as the final path component, so we could use a
> > single value here, though I suspect the implementation would be more
> > complicated than what is presented here.
> 
> Is the name hash stored in the pack itself? I know that it is stored
> in the 'struct object_entry' data in the packing data. While we could
> add another uint32_t into that struct to store both hash values, this
> would increase the memory requirements of repacking by four bytes per
> object. The struct seemed to be very clear about trying as hard as
> possible to avoid doing that.

It's stored in the .bitmap files, since otherwise a pack-objects which
uses bitmaps to serve a fetch would have no clue of their path names.
See the "HASH_CACHE" bitmap extension.

You generally don't want to make deltas out of two entries in the bitmap
(they're already in the same pack, so we'd usually skip them), but you
do want to consider making on-the-fly deltas against other objects.

I guess we may also consider deltas between objects in two packs that
are both covered by the same midx bitmap.

> But maybe an alternative could be replacing that 32-bit number with
> an index into an array of paths that have their hash values stored
> there.

Yes, that would work, though how big is that path array going to be?
Uncompressed linux.git is probably 3-4MB, which actually doesn't sound
_too_ bad. You could obviously go a long way with prefix compression,
too.

But if I understand the proposal, it is just replacing one 32-bit hash
with another. You could just store that in the bitmap instead (or if the
direction is to use both, introduce a new extension to store both).
Obviously you'll get lousy results if the bitmap reader does not use the
same algorithm for its non-bitmap objects, but I don't think this is
something you'd be flipping back and forth on.

> This is part of the reason why I think the --full-name-hash option is
> an interesting consideration. It doesn't have any obvious reason why
> it couldn't work with features like delta islands, so it may provide
> some quick wins in "large enough" repositories, or at least "large in
> the right way".
> 
> I unfortunately don't know enough about how the delta islands feature
> works to be confident in the possibility of integrating it with the
> --path-walk option. At minimum, it would require two object walks:
> the first would mark the objects and the second would do the delta
> compression with those markings in mind.

The delta islands code already does its own tree walk to propagate the
bits down (it does rely on the base walk's show_commit() to propagate
through the commits).

Once each object has its island bitmaps, I think however you choose to
come up with delta candidates (whether the current type/size/namehash
sorted list, or some path walking), you should be able to use it. It's
fundamentally just answering the question of "am I allowed to delta
between these two objects".

Of course the devil may be in the details. ;)

-Peff

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v2 0/6] pack-objects: create new name-hash algorithm
  2024-09-09 13:56 [PATCH 0/4] pack-objects: create new name-hash algorithm Derrick Stolee via GitGitGadget
                   ` (5 preceding siblings ...)
  2024-09-09 18:06 ` Junio C Hamano
@ 2024-09-18 20:46 ` Derrick Stolee via GitGitGadget
  2024-09-18 20:46   ` [PATCH v2 1/6] pack-objects: add --full-name-hash option Derrick Stolee via GitGitGadget
                     ` (8 more replies)
  6 siblings, 9 replies; 38+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-09-18 20:46 UTC (permalink / raw)
  To: git
  Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
	Derrick Stolee

I've been focused recently on understanding and mitigating the growth of a
few internal repositories. Some of these are growing much larger than
expected for the number of contributors, and there are multiple aspects to
why this growth is so large.

This is part of the RFC I submitted [1] [2] involving the path-walk API,
though this doesn't use the path-walk API directly. In full repack cases, it
seems that the --full-name-hash option gets nearly as good compression as
the --path-walk option introduced in that series. I continue to work on that
feature as well, so we can review it after this series is complete.

[1] https://github.com/gitgitgadget/git/pull/1786

[2]
https://lore.kernel.org/git/pull.1786.git.1725935335.gitgitgadget@gmail.com/

The main issue plaguing these repositories is that deltas are not being
computed against objects that appear at the same path. While the size of
these files at tip is one aspect of growth that would prevent this issue,
the changes to these files are reasonable and should result in good delta
compression. However, Git is not discovering the connections across
different versions of the same file.

One way to find some improvement in these repositories is to increase the
window size, which was an initial indicator that the delta compression could
be improved, but was not a clear indicator. After some digging (and
prototyping some analysis tools) the main discovery was that the current
name-hash algorithm only considers the last 16 characters in the path name
and has some naturally-occurring collisions within that scope.

This series introduces a new name-hash algorithm, but does not replace the
existing one. There are cases, such as packing a single snapshot of a
repository, where the existing algorithm outperforms the new one.

However, my findings show that when a repository has many versions of files
at the same path (and especially when there are many name-hash collisions)
then there are significant gains to be made using the new algorithm.

(This table is updated in v2 with even more private examples that were found
while communicating findings internally.)

| Repo     | Standard Repack | With --full-name-hash |
|----------|-----------------|-----------------------|
| fluentui |         438 MB  |               168 MB  |
| Repo B   |       6,255 MB  |               829 MB  |
| Repo C   |      37,737 MB  |             7,125 MB  |
| Repo D   |     130,049 MB  |             6,190 MB  |
| Repo E   |     100,957 MB  |            22,979 MB  |
| Repo F   |       8,308 MB  |               746 MB  |
| Repo G   |       4,329 MB  |             3,643 MB  |

I include Repo G here as an example where the improvement is less drastic,
since this repo does not demonstrate a very high rate of name-hash
collisions; the collisions that exist seem to be in paths that are not
changed very often. Thus, the standard name-hash algorithm is nearly as
effective in these full repacks.

The main change in this series is in patch 1, which adds the algorithm and
the option to 'git pack-objects' and 'git repack'. The remaining patches are
focused on creating more evidence around the value of the new name-hash
algorithm and its effects on the packfiles created with it.

I will also try to make clear that I've been focused on client-side
performance and size concerns. Based on discussions in v1, it appears that
the following is true:

 * This feature is completely orthogonal to delta islands.
 * Changing the name-hash function can lead to compatibility issues with
   .bitmap files, as they store a name-hash value. Without augmenting the
   data structure to indicate which name-hash value was used at write time,
   the full-name-hash values should not be stored in the .bitmap files or
   used when reading .bitmap files and other objects. Thus, the
   full-name-hash is marked as incompatible with bitmaps for now.
 * The --path-walk option from the RFC ([1] and [2]) is likely incompatible
   with the delta-islands feature without significant work, so this suggests
   that the --full-name-hash is a better long-term solution for servers.
   This would still require the .bitmap modifications to make it work, but
   it's a smaller leap to get there.

Thanks, -Stolee

UPDATES IN V2
=============

Thank you for all of the discussion on v1. Here are the things I've learned
and how they have changed this patch series:

 * The test-tool name-hash helper change collides with the removal of
   test-tool oid-array, so I have rebased this series onto the latest master
   branch.
 * I learned (thanks, Taylor and Peff) that the .bitmap files store
   name-hash values. This means that the --full-name-hash option risks
   having different name-hash functions across bitmap reads and dynamic
   computations from the object walk. For this reason, the option is now
   made explicit to not work with bitmap walks. This could be corrected in
   the future with a modification to the .bitmap data structure to store a
   "name-hash version" value. This behavior is confirmed with a test.
 * A new test explicitly tests that git repack --full-name-hash passes the
   option to git pack-objects. This uses a new helper method,
   test_subcommand_flex that is more flexible than the existing
   test_subcommand.
 * To get more testing across different cases, add a new
   GIT_TEST_FULL_NAME_HASH environment variable. I point out which tests
   need modification when this is enabled.
 * The size-comparison tests were not portable with their use of du.
   Instead, use wc -c for the single pack-file that remains after a git
   repack -adf ... command.
 * The final patch still updates the test-tool name-hash helper, which was
   previously only used by a performance test. Now, use that test in a
   regular test to help guarantee that the functions do not change over
   time. This is directly related to the fact that these values can be
   stored in the .bitmap files and we need stable hash functions to keep
   compatibility with files written by previous versions of Git.

Other things that have happened include investigations into ways to adapt
the full-name hash to improve upon the name-hash. I did some experimenting
with increasing the size of 'struct object_entry' by using a 64-bit hash
value (name-hash, then full-name-hash) for a single-pass compression or two
32-bit hash values for a two-pass compression process. I include my WIP
branch at [3] to show what I tried, though the single-pass option did not
present any improvements and the two-pass option seems to be broken to the
point that the compression is substantially worse. (I'll try to elaborate on
this in a reply to this cover letter.)

[3]
https://github.com/derrickstolee/git/compare/full-name...derrickstolee:git:full-name-wip

Derrick Stolee (6):
  pack-objects: add --full-name-hash option
  repack: test --full-name-hash option
  pack-objects: add GIT_TEST_FULL_NAME_HASH
  git-repack: update usage to match docs
  p5313: add size comparison test
  test-tool: add helper for name-hash values

 Documentation/git-pack-objects.txt |  3 +-
 Documentation/git-repack.txt       |  4 +-
 Makefile                           |  1 +
 builtin/pack-objects.c             | 25 ++++++++---
 builtin/repack.c                   |  9 +++-
 ci/run-build-and-tests.sh          |  1 +
 pack-objects.h                     | 21 +++++++++
 t/README                           |  4 ++
 t/helper/test-name-hash.c          | 23 ++++++++++
 t/helper/test-tool.c               |  1 +
 t/helper/test-tool.h               |  1 +
 t/perf/p5313-pack-objects.sh       | 71 ++++++++++++++++++++++++++++++
 t/perf/p5314-name-hash.sh          | 41 +++++++++++++++++
 t/t0450/txt-help-mismatches        |  1 -
 t/t5300-pack-object.sh             | 15 +++++++
 t/t5310-pack-bitmaps.sh            | 26 +++++++++++
 t/t5510-fetch.sh                   |  7 ++-
 t/t5616-partial-clone.sh           | 26 ++++++++++-
 t/t6020-bundle-misc.sh             |  6 ++-
 t/t7700-repack.sh                  |  7 +++
 t/test-lib-functions.sh            | 27 ++++++++++++
 21 files changed, 307 insertions(+), 13 deletions(-)
 create mode 100644 t/helper/test-name-hash.c
 create mode 100755 t/perf/p5313-pack-objects.sh
 create mode 100755 t/perf/p5314-name-hash.sh

base-commit: 3fb745257b30a643ee78c9a7c52ab107c82e4745
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1785%2Fderrickstolee%2Ffull-name-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1785/derrickstolee/full-name-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/1785

Range-diff vs v1:

 1:  ff30f774ca8 ! 1:  5f52e261765 pack-objects: add --full-name-hash option
     @@ builtin/pack-objects.c: int cmd_pack_objects(int argc, const char **argv, const
       		OPT_END(),
       	};

     +@@ builtin/pack-objects.c: int cmd_pack_objects(int argc, const char **argv, const char *prefix)
     + 	if (pack_to_stdout || !rev_list_all)
     + 		write_bitmap_index = 0;
     + 
     ++	if (write_bitmap_index && use_full_name_hash)
     ++		die(_("currently, the --full-name-hash option is incompatible with --write-bitmap-index"));
     ++
     + 	if (use_delta_islands)
     + 		strvec_push(&rp, "--topo-order");
     + 

       ## builtin/repack.c ##
      @@ builtin/repack.c: struct pack_objects_args {
     @@ pack-objects.h: static inline uint32_t pack_name_hash(const char *name)
      +		return 0;
      +
      +	/*
     -+	 * Just do the dumbest thing possible: add random multiples of a
     -+	 * large prime number with a binary shift. Goal is not cryptographic,
     -+	 * but generally uniformly distributed.
     ++	 * Do the simplest thing that will resemble pseduo-randomness: add
     ++	 * random multiples of a large prime number with a binary shift.
     ++	 * The goal is not to be cryptographic, but to be generally
     ++	 * uniformly distributed.
      +	 */
      +	while ((c = *name++) != 0) {
      +		hash += c * bigp;
     @@ pack-objects.h: static inline uint32_t pack_name_hash(const char *name)
       static inline enum object_type oe_type(const struct object_entry *e)
       {
       	return e->type_valid ? e->type_ : OBJ_BAD;
     +
     + ## t/t5300-pack-object.sh ##
     +@@ t/t5300-pack-object.sh: do
     + 	'
     + done
     + 
     ++# The following test is not necessarily a permanent choice, but since we do not
     ++# have a "name hash version" bit in the .bitmap file format, we cannot write the
     ++# full-name hash values into the .bitmap file without risking breakage later.
     ++#
     ++# TODO: Make these compatible in the future and replace this test with the
     ++# expected behavior when both are specified.
     ++test_expect_success '--full-name-hash and --write-bitmap-index are incompatible' '
     ++	test_must_fail git pack-objects base --all \
     ++		--full-name-hash --write-bitmap-index 2>err &&
     ++	grep incompatible err &&
     ++
     ++	# --stdout option silently removes --write-bitmap-index
     ++	git pack-objects --stdout --all --full-name-hash --write-bitmap-index >out
     ++'
     ++
     + test_done
 -:  ----------- > 2:  48b87cccedb repack: test --full-name-hash option
 -:  ----------- > 3:  48b3876a102 pack-objects: add GIT_TEST_FULL_NAME_HASH
 2:  eb0aa5f4e94 = 4:  c5827126120 git-repack: update usage to match docs
 3:  ed3ea4281a9 ! 5:  999b1d09424 p5313: add size comparison test
     @@ t/perf/p5313-pack-objects.sh (new)
      +'
      +
      +test_size 'repack size' '
     -+	du -a .git/objects/pack | sort -nr | awk "{ print \$1; }" | head -n 1
     ++	wc -c <.git/objects/pack/pack-*.pack
      +'
      +
      +test_perf 'repack with --full-name-hash' '
     @@ t/perf/p5313-pack-objects.sh (new)
      +'
      +
      +test_size 'repack size with --full-name-hash' '
     -+	du -a .git/objects/pack | sort -nr | awk "{ print \$1; }" | head -n 1
     ++	wc -c <.git/objects/pack/pack-*.pack
      +'
      +
      +test_done
 4:  5bcbcce6665 ! 6:  7e47fc8cb53 p5314: add a size test for name-hash collisions
     @@ Metadata
      Author: Derrick Stolee <dstolee@microsoft.com>

       ## Commit message ##
     -    p5314: add a size test for name-hash collisions
     +    test-tool: add helper for name-hash values

          Add a new test-tool helper, name-hash, to output the value of the
          name-hash algorithms for the input list of strings, one per line.

     +    Since the name-hash values can be stored in the .bitmap files, it is
     +    important that these hash functions do not change across Git versions.
     +    Add a simple test to t5310-pack-bitmaps.sh to provide some testing of
     +    the current values. Due to how these functions are implemented, it would
     +    be difficult to change them without disturbing these values.
     +
          Create a performance test that uses test_size to demonstrate how
          collisions occur for these hash algorithms. This test helps inform
          someone as to the behavior of the name-hash algorithms for their repo
     @@ Makefile: TEST_BUILTINS_OBJS += test-lazy-init-name-hash.o
       TEST_BUILTINS_OBJS += test-mergesort.o
       TEST_BUILTINS_OBJS += test-mktemp.o
      +TEST_BUILTINS_OBJS += test-name-hash.o
     - TEST_BUILTINS_OBJS += test-oid-array.o
       TEST_BUILTINS_OBJS += test-online-cpus.o
       TEST_BUILTINS_OBJS += test-pack-mtimes.o
     + TEST_BUILTINS_OBJS += test-parse-options.o

       ## t/helper/test-name-hash.c (new) ##
      @@
     @@ t/helper/test-tool.c: static struct test_cmd cmds[] = {
       	{ "mergesort", cmd__mergesort },
       	{ "mktemp", cmd__mktemp },
      +	{ "name-hash", cmd__name_hash },
     - 	{ "oid-array", cmd__oid_array },
       	{ "online-cpus", cmd__online_cpus },
       	{ "pack-mtimes", cmd__pack_mtimes },
     + 	{ "parse-options", cmd__parse_options },

       ## t/helper/test-tool.h ##
      @@ t/helper/test-tool.h: int cmd__lazy_init_name_hash(int argc, const char **argv);
     @@ t/perf/p5314-name-hash.sh (new)
      +'
      +
      +test_done
     +
     + ## t/t5310-pack-bitmaps.sh ##
     +@@ t/t5310-pack-bitmaps.sh: has_any () {
     + 	grep -Ff "$1" "$2"
     + }
     + 
     ++# Since name-hash values are stored in the .bitmap files, add a test
     ++# that checks that the name-hash calculations are stable across versions.
     ++# Not exhaustive, but these hashing algorithms would be hard to change
     ++# without causing deviations here.
     ++test_expect_success 'name-hash value stability' '
     ++	cat >names <<-\EOF &&
     ++	first
     ++	second
     ++	third
     ++	one-long-enough-for-collisions
     ++	two-long-enough-for-collisions
     ++	EOF
     ++
     ++	test-tool name-hash <names >out &&
     ++
     ++	cat >expect <<-\EOF &&
     ++	2582249472	3109209818	first
     ++	2289942528	3781118409	second
     ++	2300837888	3028707182	third
     ++	2544516325	3241327563	one-long-enough-for-collisions
     ++	2544516325	4207880830	two-long-enough-for-collisions
     ++	EOF
     ++
     ++	test_cmp expect out
     ++'
     ++
     + test_bitmap_cases () {
     + 	writeLookupTable=false
     + 	for i in "$@"

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v2 1/6] pack-objects: add --full-name-hash option
  2024-09-18 20:46 ` [PATCH v2 0/6] " Derrick Stolee via GitGitGadget
@ 2024-09-18 20:46   ` Derrick Stolee via GitGitGadget
  2024-09-19 21:56     ` Junio C Hamano
  2024-09-18 20:46   ` [PATCH v2 2/6] repack: test " Derrick Stolee via GitGitGadget
                     ` (7 subsequent siblings)
  8 siblings, 1 reply; 38+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-09-18 20:46 UTC (permalink / raw)
  To: git
  Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <stolee@gmail.com>

The pack_name_hash() method has not been materially changed since it was
introduced in ce0bd64299a (pack-objects: improve path grouping
heuristics., 2006-06-05). The intention here is to group objects by path
name, but also attempt to group similar file types together by making
the most-significant digits of the hash be focused on the final
characters.

Here's the crux of the implementation:

	/*
	 * This effectively just creates a sortable number from the
	 * last sixteen non-whitespace characters. Last characters
	 * count "most", so things that end in ".c" sort together.
	 */
	while ((c = *name++) != 0) {
		if (isspace(c))
			continue;
		hash = (hash >> 2) + (c << 24);
	}

As the comment mentions, this only cares about the last sixteen
non-whitespace characters. This cause some filenames to collide more
than others. Here are some examples that I've seen while investigating
repositories that are growing more than they should be:

 * "/CHANGELOG.json" is 15 characters, and is created by the beachball
   [1] tool. Only the final character of the parent directory can
   differntiate different versions of this file, but also only the two
   most-significant digits. If that character is a letter, then this is
   always a collision. Similar issues occur with the similar
   "/CHANGELOG.md" path, though there is more opportunity for
   differences in the parent directory.

 * Localization files frequently have common filenames but differentiate
   via parent directories. In C#, the name "/strings.resx.lcl" is used
   for these localization files and they will all collide in name-hash.

[1] https://github.com/microsoft/beachball

I've come across many other examples where some internal tool uses a
common name across multiple directories and is causing Git to repack
poorly due to name-hash collisions.

It is clear that the existing name-hash algorithm is optimized for
repositories with short path names, but also is optimized for packing a
single snapshot of a repository, not a repository with many versions of
the same file. In my testing, this has proven out where the name-hash
algorithm does a good job of finding peer files as delta bases when
unable to use a historical version of that exact file.

However, for repositories that have many versions of most files and
directories, it is more important that the objects that appear at the
same path are grouped together.

Create a new pack_full_name_hash() method and a new --full-name-hash
option for 'git pack-objects' to call that method instead. Add a simple
pass-through for 'git repack --full-name-hash' for additional testing in
the context of a full repack, where I expect this will be most
effective.

The hash algorithm is as simple as possible to be reasonably effective:
for each character of the path string, add a multiple of that character
and a large prime number (chosen arbitrarily, but intended to be large
relative to the size of a uint32_t). Then, shift the current hash value
to the right by 5, with overlap. The addition and shift parameters are
standard mechanisms for creating hard-to-predict behaviors in the bits
of the resulting hash.

This is not meant to be cryptographic at all, but uniformly distributed
across the possible hash values. This creates a hash that appears
pseudorandom. There is no ability to consider similar file types as
being close to each other.

In a later change, a test-tool will be added so the effectiveness of
this hash can be demonstrated directly.

For now, let's consider how effective this mechanism is when repacking a
repository with and without the --full-name-hash option. Specifically,
let's use 'git repack -adf [--full-name-hash]' as our test.

On the Git repository, we do not expect much difference. All path names
are short. This is backed by our results:

| Stage                 | Pack Size | Repack Time |
|-----------------------|-----------|-------------|
| After clone           | 260 MB    | N/A         |
| Standard Repack       | 127MB     | 106s        |
| With --full-name-hash | 126 MB    | 99s         |

This example demonstrates how there is some natural overhead coming from
the cloned copy because the server is hosting many forks and has not
optimized for exactly this set of reachable objects. But the full repack
has similar characteristics with and without --full-name-hash.

However, we can test this in a repository that uses one of the
problematic naming conventions above. The fluentui [2] repo uses
beachball to generate CHANGELOG.json and CHANGELOG.md files, and these
files have very poor delta characteristics when comparing against
versions across parent directories.

| Stage                 | Pack Size | Repack Time |
|-----------------------|-----------|-------------|
| After clone           | 694 MB    | N/A         |
| Standard Repack       | 438 MB    | 728s        |
| With --full-name-hash | 168 MB    | 142s        |

[2] https://github.com/microsoft/fluentui

In this example, we see significant gains in the compressed packfile
size as well as the time taken to compute the packfile.

Using a collection of repositories that use the beachball tool, I was
able to make similar comparisions with dramatic results. While the
fluentui repo is public, the others are private so cannot be shared for
reproduction. The results are so significant that I find it important to
share here:

| Repo     | Standard Repack | With --full-name-hash |
|----------|-----------------|-----------------------|
| fluentui |         438 MB  |               168 MB  |
| Repo B   |       6,255 MB  |               829 MB  |
| Repo C   |      37,737 MB  |             7,125 MB  |
| Repo D   |     130,049 MB  |             6,190 MB  |

Future changes could include making --full-name-hash implied by a config
value or even implied by default during a full repack.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 Documentation/git-pack-objects.txt |  3 ++-
 builtin/pack-objects.c             | 23 ++++++++++++++++++-----
 builtin/repack.c                   |  5 +++++
 pack-objects.h                     | 21 +++++++++++++++++++++
 t/t5300-pack-object.sh             | 15 +++++++++++++++
 5 files changed, 61 insertions(+), 6 deletions(-)

diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
index e32404c6aae..93861d9f85b 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -15,7 +15,8 @@ SYNOPSIS
 	[--revs [--unpacked | --all]] [--keep-pack=<pack-name>]
 	[--cruft] [--cruft-expiration=<time>]
 	[--stdout [--filter=<filter-spec>] | <base-name>]
-	[--shallow] [--keep-true-parents] [--[no-]sparse] < <object-list>
+	[--shallow] [--keep-true-parents] [--[no-]sparse]
+	[--full-name-hash] < <object-list>

 DESCRIPTION
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index c6e2852d3ca..4a990d15609 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -266,6 +266,14 @@ struct configured_exclusion {
 static struct oidmap configured_exclusions;

 static struct oidset excluded_by_config;
+static int use_full_name_hash;
+
+static inline uint32_t pack_name_hash_fn(const char *name)
+{
+	if (use_full_name_hash)
+		return pack_full_name_hash(name);
+	return pack_name_hash(name);
+}

 /*
  * stats
@@ -1698,7 +1706,7 @@ static int add_object_entry(const struct object_id *oid, enum object_type type,
 		return 0;
 	}

-	create_object_entry(oid, type, pack_name_hash(name),
+	create_object_entry(oid, type, pack_name_hash_fn(name),
 			    exclude, name && no_try_delta(name),
 			    found_pack, found_offset);
 	return 1;
@@ -1912,7 +1920,7 @@ static void add_preferred_base_object(const char *name)
 {
 	struct pbase_tree *it;
 	size_t cmplen;
-	unsigned hash = pack_name_hash(name);
+	unsigned hash = pack_name_hash_fn(name);

 	if (!num_preferred_base || check_pbase_path(hash))
 		return;
@@ -3422,7 +3430,7 @@ static void show_object_pack_hint(struct object *object, const char *name,
 	 * here using a now in order to perhaps improve the delta selection
 	 * process.
 	 */
-	oe->hash = pack_name_hash(name);
+	oe->hash = pack_name_hash_fn(name);
 	oe->no_try_delta = name && no_try_delta(name);

 	stdin_packs_hints_nr++;
@@ -3572,7 +3580,7 @@ static void add_cruft_object_entry(const struct object_id *oid, enum object_type
 	entry = packlist_find(&to_pack, oid);
 	if (entry) {
 		if (name) {
-			entry->hash = pack_name_hash(name);
+			entry->hash = pack_name_hash_fn(name);
 			entry->no_try_delta = no_try_delta(name);
 		}
 	} else {
@@ -3595,7 +3603,7 @@ static void add_cruft_object_entry(const struct object_id *oid, enum object_type
 			return;
 		}

-		entry = create_object_entry(oid, type, pack_name_hash(name),
+		entry = create_object_entry(oid, type, pack_name_hash_fn(name),
 					    0, name && no_try_delta(name),
 					    pack, offset);
 	}
@@ -4426,6 +4434,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		OPT_STRING_LIST(0, "uri-protocol", &uri_protocols,
 				N_("protocol"),
 				N_("exclude any configured uploadpack.blobpackfileuri with this protocol")),
+		OPT_BOOL(0, "full-name-hash", &use_full_name_hash,
+			 N_("optimize delta compression across identical path names over time")),
 		OPT_END(),
 	};

@@ -4573,6 +4583,9 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	if (pack_to_stdout || !rev_list_all)
 		write_bitmap_index = 0;

+	if (write_bitmap_index && use_full_name_hash)
+		die(_("currently, the --full-name-hash option is incompatible with --write-bitmap-index"));
+
 	if (use_delta_islands)
 		strvec_push(&rp, "--topo-order");

diff --git a/builtin/repack.c b/builtin/repack.c
index 8bb875532b4..87d0cd4d2f2 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -57,6 +57,7 @@ struct pack_objects_args {
 	int no_reuse_object;
 	int quiet;
 	int local;
+	int full_name_hash;
 	struct list_objects_filter_options filter_options;
 };

@@ -288,6 +289,8 @@ static void prepare_pack_objects(struct child_process *cmd,
 		strvec_pushf(&cmd->args, "--no-reuse-delta");
 	if (args->no_reuse_object)
 		strvec_pushf(&cmd->args, "--no-reuse-object");
+	if (args->full_name_hash)
+		strvec_pushf(&cmd->args, "--full-name-hash");
 	if (args->local)
 		strvec_push(&cmd->args,  "--local");
 	if (args->quiet)
@@ -1176,6 +1179,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 				N_("pass --no-reuse-delta to git-pack-objects")),
 		OPT_BOOL('F', NULL, &po_args.no_reuse_object,
 				N_("pass --no-reuse-object to git-pack-objects")),
+		OPT_BOOL(0, "full-name-hash", &po_args.full_name_hash,
+				N_("pass --full-name-hash to git-pack-objects")),
 		OPT_NEGBIT('n', NULL, &run_update_server_info,
 				N_("do not run git-update-server-info"), 1),
 		OPT__QUIET(&po_args.quiet, N_("be quiet")),
diff --git a/pack-objects.h b/pack-objects.h
index b9898a4e64b..b0a2d80854f 100644
--- a/pack-objects.h
+++ b/pack-objects.h
@@ -207,6 +207,27 @@ static inline uint32_t pack_name_hash(const char *name)
 	return hash;
 }

+static inline uint32_t pack_full_name_hash(const char *name)
+{
+	const uint32_t bigp = 1234572167U;
+	uint32_t c, hash = bigp;
+
+	if (!name)
+		return 0;
+
+	/*
+	 * Do the simplest thing that will resemble pseduo-randomness: add
+	 * random multiples of a large prime number with a binary shift.
+	 * The goal is not to be cryptographic, but to be generally
+	 * uniformly distributed.
+	 */
+	while ((c = *name++) != 0) {
+		hash += c * bigp;
+		hash = (hash >> 5) | (hash << 27);
+	}
+	return hash;
+}
+
 static inline enum object_type oe_type(const struct object_entry *e)
 {
 	return e->type_valid ? e->type_ : OBJ_BAD;
diff --git a/t/t5300-pack-object.sh b/t/t5300-pack-object.sh
index 3b9dae331a5..fff49000c31 100755
--- a/t/t5300-pack-object.sh
+++ b/t/t5300-pack-object.sh
@@ -674,4 +674,19 @@ do
 	'
 done

+# The following test is not necessarily a permanent choice, but since we do not
+# have a "name hash version" bit in the .bitmap file format, we cannot write the
+# full-name hash values into the .bitmap file without risking breakage later.
+#
+# TODO: Make these compatible in the future and replace this test with the
+# expected behavior when both are specified.
+test_expect_success '--full-name-hash and --write-bitmap-index are incompatible' '
+	test_must_fail git pack-objects base --all \
+		--full-name-hash --write-bitmap-index 2>err &&
+	grep incompatible err &&
+
+	# --stdout option silently removes --write-bitmap-index
+	git pack-objects --stdout --all --full-name-hash --write-bitmap-index >out
+'
+
 test_done
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 2/6] repack: test --full-name-hash option
  2024-09-18 20:46 ` [PATCH v2 0/6] " Derrick Stolee via GitGitGadget
  2024-09-18 20:46   ` [PATCH v2 1/6] pack-objects: add --full-name-hash option Derrick Stolee via GitGitGadget
@ 2024-09-18 20:46   ` Derrick Stolee via GitGitGadget
  2024-09-19 22:11     ` Junio C Hamano
  2024-09-18 20:46   ` [PATCH v2 3/6] pack-objects: add GIT_TEST_FULL_NAME_HASH Derrick Stolee via GitGitGadget
                     ` (6 subsequent siblings)
  8 siblings, 1 reply; 38+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-09-18 20:46 UTC (permalink / raw)
  To: git
  Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <stolee@gmail.com>

The new '--full-name-hash' option for 'git repack' is a simple
pass-through to the underlying 'git pack-objects' subcommand. However,
this subcommand may have other options and a temporary filename as part
of the subcommand execution that may not be predictable or could change
over time.

The existing test_subcommand method requires an exact list of arguments
for the subcommand. This is too rigid for our needs here, so create a
new method, test_subcommand_flex. Use it to check that the
--full-name-hash option is passing through.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 t/t7700-repack.sh       |  7 +++++++
 t/test-lib-functions.sh | 27 +++++++++++++++++++++++++++
 2 files changed, 34 insertions(+)

diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
index be1188e7365..1feb6643e69 100755
--- a/t/t7700-repack.sh
+++ b/t/t7700-repack.sh
@@ -776,6 +776,13 @@ test_expect_success 'repack -ad cleans up old .tmp-* packs' '
 	test_must_be_empty tmpfiles
 '
 
+test_expect_success '--full-name-hash option passes through to pack-objects' '
+	GIT_TRACE2_EVENT="$(pwd)/full-trace.txt" \
+		git repack -a --full-name-hash &&
+	test_subcommand_flex git pack-objects --full-name-hash <full-trace.txt
+'
+
+
 test_expect_success 'setup for update-server-info' '
 	git init update-server-info &&
 	test_commit -C update-server-info message
diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh
index fde9bf54fc3..0d94730924a 100644
--- a/t/test-lib-functions.sh
+++ b/t/test-lib-functions.sh
@@ -1885,6 +1885,33 @@ test_subcommand () {
 	fi
 }
 
+
+# Check that the given subcommand was run with the given set of
+# arguments in order (but with possible extra arguments).
+#
+#	test_subcommand_flex [!] <command> <args>... < <trace>
+#
+# If the first parameter passed is !, this instead checks that
+# the given command was not called.
+#
+test_subcommand_flex () {
+	local negate=
+	if test "$1" = "!"
+	then
+		negate=t
+		shift
+	fi
+
+	local expr="$(printf '"%s".*' "$@")"
+
+	if test -n "$negate"
+	then
+		! grep "\[$expr\]"
+	else
+		grep "\[$expr\]"
+	fi
+}
+
 # Check that the given command was invoked as part of the
 # trace2-format trace on stdin.
 #
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 3/6] pack-objects: add GIT_TEST_FULL_NAME_HASH
  2024-09-18 20:46 ` [PATCH v2 0/6] " Derrick Stolee via GitGitGadget
  2024-09-18 20:46   ` [PATCH v2 1/6] pack-objects: add --full-name-hash option Derrick Stolee via GitGitGadget
  2024-09-18 20:46   ` [PATCH v2 2/6] repack: test " Derrick Stolee via GitGitGadget
@ 2024-09-18 20:46   ` Derrick Stolee via GitGitGadget
  2024-09-19 22:22     ` Junio C Hamano
  2024-09-18 20:46   ` [PATCH v2 4/6] git-repack: update usage to match docs Derrick Stolee via GitGitGadget
                     ` (5 subsequent siblings)
  8 siblings, 1 reply; 38+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-09-18 20:46 UTC (permalink / raw)
  To: git
  Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <stolee@gmail.com>

Add a new environment variable to opt-in to the --full-name-hash option
in 'git pack-objects'. This allows for extra testing of the feature
without repeating all of the test scenarios.

But this option isn't free. There are a few tests that change behavior
with the variable enabled.

First, there are a few tests that are very sensitive to certain delta
bases being picked. These are both involving the generation of thin
bundles and then counting their objects via 'git index-pack --fix-thin'
which pulls the delta base into the new packfile. For these tests,
disable the option as a decent long-term option.

Second, there are two tests in t5616-partial-clone.sh that I believe are
actually broken scenarios. While the client is set up to clone the
'promisor-server' repo via a treeless partial clone filter (tree:0),
that filter does not translate to the 'server' repo. Thus, fetching from
these repos causes the server to think that the client has all reachable
trees and blobs from the commits advertised as 'haves'. This leads the
server to providing a thin pack assuming those objects as delta bases.
Changing the name-hash algorithm presents new delta bases and thus
breaks the expectations of these tests. An alternative could be to set
up 'server' as a promisor server with the correct filter enabled. This
may also point out more issues with partial clone being set up as a
remote-based filtering mechanism and not a repository-wide setting. For
now, do the minimal change to make the test work by disabling the test
variable.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 builtin/pack-objects.c    |  6 ++++--
 ci/run-build-and-tests.sh |  1 +
 t/README                  |  4 ++++
 t/t5510-fetch.sh          |  7 ++++++-
 t/t5616-partial-clone.sh  | 26 ++++++++++++++++++++++++--
 t/t6020-bundle-misc.sh    |  6 +++++-
 6 files changed, 44 insertions(+), 6 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 4a990d15609..8876fa3d641 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -266,7 +266,7 @@ struct configured_exclusion {
 static struct oidmap configured_exclusions;
 
 static struct oidset excluded_by_config;
-static int use_full_name_hash;
+static int use_full_name_hash = -1;
 
 static inline uint32_t pack_name_hash_fn(const char *name)
 {
@@ -4583,8 +4583,10 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	if (pack_to_stdout || !rev_list_all)
 		write_bitmap_index = 0;
 
-	if (write_bitmap_index && use_full_name_hash)
+	if (write_bitmap_index && use_full_name_hash > 0)
 		die(_("currently, the --full-name-hash option is incompatible with --write-bitmap-index"));
+	if (use_full_name_hash < 0)
+		use_full_name_hash = git_env_bool("GIT_TEST_FULL_NAME_HASH", 0);
 
 	if (use_delta_islands)
 		strvec_push(&rp, "--topo-order");
diff --git a/ci/run-build-and-tests.sh b/ci/run-build-and-tests.sh
index 2e28d02b20f..75b40f07bbd 100755
--- a/ci/run-build-and-tests.sh
+++ b/ci/run-build-and-tests.sh
@@ -30,6 +30,7 @@ linux-TEST-vars)
 	export GIT_TEST_NO_WRITE_REV_INDEX=1
 	export GIT_TEST_CHECKOUT_WORKERS=2
 	export GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL=1
+	export GIT_TEST_FULL_NAME_HASH=1
 	;;
 linux-clang)
 	export GIT_TEST_DEFAULT_HASH=sha1
diff --git a/t/README b/t/README
index 44c02d81298..eaaf9fcd57e 100644
--- a/t/README
+++ b/t/README
@@ -488,6 +488,10 @@ a test and then fails then the whole test run will abort. This can help to make
 sure the expected tests are executed and not silently skipped when their
 dependency breaks or is simply not present in a new environment.
 
+GIT_TEST_FULL_NAME_HASH=<boolean>, when true, sets the default name-hash
+function in 'git pack-objects' to be the one used by the --full-name-hash
+option.
+
 Naming Tests
 ------------
 
diff --git a/t/t5510-fetch.sh b/t/t5510-fetch.sh
index 0890b9f61c5..be79c72495e 100755
--- a/t/t5510-fetch.sh
+++ b/t/t5510-fetch.sh
@@ -1062,7 +1062,12 @@ test_expect_success 'all boundary commits are excluded' '
 	test_tick &&
 	git merge otherside &&
 	ad=$(git log --no-walk --format=%ad HEAD) &&
-	git bundle create twoside-boundary.bdl main --since="$ad" &&
+
+	# If the --full-name-hash function is used here, then no delta
+	# pair is found and the bundle does not expand to three objects
+	# when fixing the thin object.
+	GIT_TEST_FULL_NAME_HASH=0 \
+		git bundle create twoside-boundary.bdl main --since="$ad" &&
 	test_bundle_object_count --thin twoside-boundary.bdl 3
 '
 
diff --git a/t/t5616-partial-clone.sh b/t/t5616-partial-clone.sh
index 8415884754b..69d5bab73cd 100755
--- a/t/t5616-partial-clone.sh
+++ b/t/t5616-partial-clone.sh
@@ -515,7 +515,18 @@ test_expect_success 'fetch lazy-fetches only to resolve deltas' '
 	# Exercise to make sure it works. Git will not fetch anything from the
 	# promisor remote other than for the big tree (because it needs to
 	# resolve the delta).
-	GIT_TRACE_PACKET="$(pwd)/trace" git -C client \
+	#
+	# TODO: the --full-name-hash option is disabled here, since this test
+	# is fundamentally broken! When GIT_TEST_FULL_NAME_HASH=1, the server
+	# recognizes delta bases in a different way and then sends a _blob_ to
+	# the client with a delta base that the client does not have! This is
+	# because the client is cloned from "promisor-server" with tree:0 but
+	# is now fetching from "server" withot any filter. This is violating the
+	# promise to the server that all reachable objects exist and could be
+	# used as delta bases!
+	GIT_TRACE_PACKET="$(pwd)/trace" \
+	GIT_TEST_FULL_NAME_HASH=0 \
+		git -C client \
 		fetch "file://$(pwd)/server" main &&
 
 	# Verify the assumption that the client needed to fetch the delta base
@@ -534,7 +545,18 @@ test_expect_success 'fetch lazy-fetches only to resolve deltas, protocol v2' '
 	# Exercise to make sure it works. Git will not fetch anything from the
 	# promisor remote other than for the big blob (because it needs to
 	# resolve the delta).
-	GIT_TRACE_PACKET="$(pwd)/trace" git -C client \
+	#
+	# TODO: the --full-name-hash option is disabled here, since this test
+	# is fundamentally broken! When GIT_TEST_FULL_NAME_HASH=1, the server
+	# recognizes delta bases in a different way and then sends a _blob_ to
+	# the client with a delta base that the client does not have! This is
+	# because the client is cloned from "promisor-server" with tree:0 but
+	# is now fetching from "server" withot any filter. This is violating the
+	# promise to the server that all reachable objects exist and could be
+	# used as delta bases!
+	GIT_TRACE_PACKET="$(pwd)/trace" \
+	GIT_TEST_FULL_NAME_HASH=0 \
+		git -C client \
 		fetch "file://$(pwd)/server" main &&
 
 	# Verify that protocol version 2 was used.
diff --git a/t/t6020-bundle-misc.sh b/t/t6020-bundle-misc.sh
index 34b5cd62c20..553a20d10e9 100755
--- a/t/t6020-bundle-misc.sh
+++ b/t/t6020-bundle-misc.sh
@@ -247,7 +247,11 @@ test_expect_success 'create bundle with --since option' '
 	EOF
 	test_cmp expect actual &&
 
-	git bundle create since.bdl \
+	# If the --full-name-hash option is used, then one fewer
+	# delta base is found and this counts a different number
+	# of objects after performing --fix-thin.
+	GIT_TEST_FULL_NAME_HASH=0 \
+		git bundle create since.bdl \
 		--since "Thu Apr 7 15:27:00 2005 -0700" \
 		--all &&
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 4/6] git-repack: update usage to match docs
  2024-09-18 20:46 ` [PATCH v2 0/6] " Derrick Stolee via GitGitGadget
                     ` (2 preceding siblings ...)
  2024-09-18 20:46   ` [PATCH v2 3/6] pack-objects: add GIT_TEST_FULL_NAME_HASH Derrick Stolee via GitGitGadget
@ 2024-09-18 20:46   ` Derrick Stolee via GitGitGadget
  2024-09-18 20:46   ` [PATCH v2 5/6] p5313: add size comparison test Derrick Stolee via GitGitGadget
                     ` (4 subsequent siblings)
  8 siblings, 0 replies; 38+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-09-18 20:46 UTC (permalink / raw)
  To: git
  Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <stolee@gmail.com>

This also adds the '--full-name-hash' option introduced in the previous
change and adds newlines to the synopsis.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 Documentation/git-repack.txt | 4 +++-
 builtin/repack.c             | 4 +++-
 t/t0450/txt-help-mismatches  | 1 -
 3 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index c902512a9e8..457a793fa89 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -9,7 +9,9 @@ git-repack - Pack unpacked objects in a repository
 SYNOPSIS
 --------
 [verse]
-'git repack' [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [-m] [--window=<n>] [--depth=<n>] [--threads=<n>] [--keep-pack=<pack-name>] [--write-midx]
+'git repack' [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [-m]
+	[--window=<n>] [--depth=<n>] [--threads=<n>] [--keep-pack=<pack-name>]
+	[--write-midx] [--full-name-hash]
 
 DESCRIPTION
 -----------
diff --git a/builtin/repack.c b/builtin/repack.c
index 87d0cd4d2f2..4fa2c25246d 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -38,7 +38,9 @@ static int run_update_server_info = 1;
 static char *packdir, *packtmp_name, *packtmp;
 
 static const char *const git_repack_usage[] = {
-	N_("git repack [<options>]"),
+	N_("git repack [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [-m]\n"
+	   "[--window=<n>] [--depth=<n>] [--threads=<n>] [--keep-pack=<pack-name>]\n"
+	   "[--write-midx] [--full-name-hash]"),
 	NULL
 };
 
diff --git a/t/t0450/txt-help-mismatches b/t/t0450/txt-help-mismatches
index 28003f18c92..c4a15fd0cb8 100644
--- a/t/t0450/txt-help-mismatches
+++ b/t/t0450/txt-help-mismatches
@@ -45,7 +45,6 @@ rebase
 remote
 remote-ext
 remote-fd
-repack
 reset
 restore
 rev-parse
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 5/6] p5313: add size comparison test
  2024-09-18 20:46 ` [PATCH v2 0/6] " Derrick Stolee via GitGitGadget
                     ` (3 preceding siblings ...)
  2024-09-18 20:46   ` [PATCH v2 4/6] git-repack: update usage to match docs Derrick Stolee via GitGitGadget
@ 2024-09-18 20:46   ` Derrick Stolee via GitGitGadget
  2024-09-19 21:58     ` Junio C Hamano
  2024-09-18 20:46   ` [PATCH v2 6/6] test-tool: add helper for name-hash values Derrick Stolee via GitGitGadget
                     ` (3 subsequent siblings)
  8 siblings, 1 reply; 38+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-09-18 20:46 UTC (permalink / raw)
  To: git
  Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <stolee@gmail.com>

As custom options are added to 'git pack-objects' and 'git repack' to
adjust how compression is done, use this new performance test script to
demonstrate their effectiveness in performance and size.

The recently-added --full-name-hash option swaps the default name-hash
algorithm with one that attempts to uniformly distribute the hashes
based on the full path name instead of the last 16 characters.

This has a dramatic effect on full repacks for repositories with many
versions of most paths. It can have a negative impact on cases such as
pushing a single change.

This can be seen by running pt5313 on the open source fluentui
repository [1]. Most commits will have this kind of output for the thin
and big pack cases, though certain commits (such as [2]) will have
problematic thin pack size for other reasons.

[1] https://github.com/microsoft/fluentui
[2] a637a06df05360ce5ff21420803f64608226a875

Checked out at the parent of [2], I see the following statistics:

Test                                           this tree
------------------------------------------------------------------
5313.2: thin pack                              0.02(0.01+0.01)
5313.3: thin pack size                                    1.1K
5313.4: thin pack with --full-name-hash        0.02(0.01+0.00)
5313.5: thin pack size with --full-name-hash              3.0K
5313.6: big pack                               1.65(3.35+0.24)
5313.7: big pack size                                    58.0M
5313.8: big pack with --full-name-hash         1.53(2.52+0.18)
5313.9: big pack size with --full-name-hash              57.6M
5313.10: repack                                176.52(706.60+3.53)
5313.11: repack size                                    446.7K
5313.12: repack with --full-name-hash          37.47(134.18+3.06)
5313.13: repack size with --full-name-hash              183.1K

Note that this demonstrates a 3x size _increase_ in the case that
simulates a small "git push". The size change is neutral on the case of
pushing the difference between HEAD and HEAD~1000.

However, the full repack case is both faster and more efficient.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 t/perf/p5313-pack-objects.sh | 71 ++++++++++++++++++++++++++++++++++++
 1 file changed, 71 insertions(+)
 create mode 100755 t/perf/p5313-pack-objects.sh

diff --git a/t/perf/p5313-pack-objects.sh b/t/perf/p5313-pack-objects.sh
new file mode 100755
index 00000000000..5dcf52acb0d
--- /dev/null
+++ b/t/perf/p5313-pack-objects.sh
@@ -0,0 +1,71 @@
+#!/bin/sh
+
+test_description='Tests pack performance using bitmaps'
+. ./perf-lib.sh
+
+GIT_TEST_PASSING_SANITIZE_LEAK=0
+export GIT_TEST_PASSING_SANITIZE_LEAK
+
+test_perf_large_repo
+
+test_expect_success 'create rev input' '
+	cat >in-thin <<-EOF &&
+	$(git rev-parse HEAD)
+	^$(git rev-parse HEAD~1)
+	EOF
+
+	cat >in-big <<-EOF
+	$(git rev-parse HEAD)
+	^$(git rev-parse HEAD~1000)
+	EOF
+'
+
+test_perf 'thin pack' '
+	git pack-objects --thin --stdout --revs --sparse  <in-thin >out
+'
+
+test_size 'thin pack size' '
+	wc -c <out
+'
+
+test_perf 'thin pack with --full-name-hash' '
+	git pack-objects --thin --stdout --revs --sparse --full-name-hash <in-thin >out
+'
+
+test_size 'thin pack size with --full-name-hash' '
+	wc -c <out
+'
+
+test_perf 'big pack' '
+	git pack-objects --stdout --revs --sparse  <in-big >out
+'
+
+test_size 'big pack size' '
+	wc -c <out
+'
+
+test_perf 'big pack with --full-name-hash' '
+	git pack-objects --stdout --revs --sparse --full-name-hash <in-big >out
+'
+
+test_size 'big pack size with --full-name-hash' '
+	wc -c <out
+'
+
+test_perf 'repack' '
+	git repack -adf
+'
+
+test_size 'repack size' '
+	wc -c <.git/objects/pack/pack-*.pack
+'
+
+test_perf 'repack with --full-name-hash' '
+	git repack -adf --full-name-hash
+'
+
+test_size 'repack size with --full-name-hash' '
+	wc -c <.git/objects/pack/pack-*.pack
+'
+
+test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 6/6] test-tool: add helper for name-hash values
  2024-09-18 20:46 ` [PATCH v2 0/6] " Derrick Stolee via GitGitGadget
                     ` (4 preceding siblings ...)
  2024-09-18 20:46   ` [PATCH v2 5/6] p5313: add size comparison test Derrick Stolee via GitGitGadget
@ 2024-09-18 20:46   ` Derrick Stolee via GitGitGadget
  2024-09-24  7:02     ` Patrick Steinhardt
  2024-09-18 23:30   ` [PATCH v2 0/6] pack-objects: create new name-hash algorithm Derrick Stolee
                     ` (2 subsequent siblings)
  8 siblings, 1 reply; 38+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-09-18 20:46 UTC (permalink / raw)
  To: git
  Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <stolee@gmail.com>

Add a new test-tool helper, name-hash, to output the value of the
name-hash algorithms for the input list of strings, one per line.

Since the name-hash values can be stored in the .bitmap files, it is
important that these hash functions do not change across Git versions.
Add a simple test to t5310-pack-bitmaps.sh to provide some testing of
the current values. Due to how these functions are implemented, it would
be difficult to change them without disturbing these values.

Create a performance test that uses test_size to demonstrate how
collisions occur for these hash algorithms. This test helps inform
someone as to the behavior of the name-hash algorithms for their repo
based on the paths at HEAD.

My copy of the Git repository shows modest statistics around the
collisions of the default name-hash algorithm:

Test                                              this tree
-----------------------------------------------------------------
5314.1: paths at head                                        4.5K
5314.2: number of distinct name-hashes                       4.1K
5314.3: number of distinct full-name-hashes                  4.5K
5314.4: maximum multiplicity of name-hashes                    13
5314.5: maximum multiplicity of fullname-hashes                 1

Here, the maximum collision multiplicity is 13, but around 10% of paths
have a collision with another path.

In a more interesting example, the microsoft/fluentui [1] repo had these
statistics at time of committing:

Test                                              this tree
-----------------------------------------------------------------
5314.1: paths at head                                       19.6K
5314.2: number of distinct name-hashes                       8.2K
5314.3: number of distinct full-name-hashes                 19.6K
5314.4: maximum multiplicity of name-hashes                   279
5314.5: maximum multiplicity of fullname-hashes                 1

[1] https://github.com/microsoft/fluentui

That demonstrates that of the nearly twenty thousand path names, they
are assigned around eight thousand distinct values. 279 paths are
assigned to a single value, leading the packing algorithm to sort
objects from those paths together, by size.

In this repository, no collisions occur for the full-name-hash
algorithm.

In a more extreme example, an internal monorepo had a much worse
collision rate:

Test                                              this tree
-----------------------------------------------------------------
5314.1: paths at head                                      221.6K
5314.2: number of distinct name-hashes                      72.0K
5314.3: number of distinct full-name-hashes                221.6K
5314.4: maximum multiplicity of name-hashes                 14.4K
5314.5: maximum multiplicity of fullname-hashes                 2

Even in this repository with many more paths at HEAD, the collision rate
was low and the maximum number of paths being grouped into a single
bucket by the full-path-name algorithm was two.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 Makefile                  |  1 +
 t/helper/test-name-hash.c | 23 ++++++++++++++++++++++
 t/helper/test-tool.c      |  1 +
 t/helper/test-tool.h      |  1 +
 t/perf/p5314-name-hash.sh | 41 +++++++++++++++++++++++++++++++++++++++
 t/t5310-pack-bitmaps.sh   | 26 +++++++++++++++++++++++++
 6 files changed, 93 insertions(+)
 create mode 100644 t/helper/test-name-hash.c
 create mode 100755 t/perf/p5314-name-hash.sh

diff --git a/Makefile b/Makefile
index 275a5ee3c9f..50797a4e541 100644
--- a/Makefile
+++ b/Makefile
@@ -812,6 +812,7 @@ TEST_BUILTINS_OBJS += test-lazy-init-name-hash.o
 TEST_BUILTINS_OBJS += test-match-trees.o
 TEST_BUILTINS_OBJS += test-mergesort.o
 TEST_BUILTINS_OBJS += test-mktemp.o
+TEST_BUILTINS_OBJS += test-name-hash.o
 TEST_BUILTINS_OBJS += test-online-cpus.o
 TEST_BUILTINS_OBJS += test-pack-mtimes.o
 TEST_BUILTINS_OBJS += test-parse-options.o
diff --git a/t/helper/test-name-hash.c b/t/helper/test-name-hash.c
new file mode 100644
index 00000000000..15fb8f853c1
--- /dev/null
+++ b/t/helper/test-name-hash.c
@@ -0,0 +1,23 @@
+/*
+ * test-name-hash.c: Read a list of paths over stdin and report on their
+ * name-hash and full name-hash.
+ */
+
+#include "test-tool.h"
+#include "git-compat-util.h"
+#include "pack-objects.h"
+#include "strbuf.h"
+
+int cmd__name_hash(int argc UNUSED, const char **argv UNUSED)
+{
+	struct strbuf line = STRBUF_INIT;
+
+	while (!strbuf_getline(&line, stdin)) {
+		uint32_t name_hash = pack_name_hash(line.buf);
+		uint32_t full_hash = pack_full_name_hash(line.buf);
+
+		printf("%10"PRIu32"\t%10"PRIu32"\t%s\n", name_hash, full_hash, line.buf);
+	}
+
+	return 0;
+}
diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
index 1ebb69a5dc4..e794058ab6d 100644
--- a/t/helper/test-tool.c
+++ b/t/helper/test-tool.c
@@ -44,6 +44,7 @@ static struct test_cmd cmds[] = {
 	{ "match-trees", cmd__match_trees },
 	{ "mergesort", cmd__mergesort },
 	{ "mktemp", cmd__mktemp },
+	{ "name-hash", cmd__name_hash },
 	{ "online-cpus", cmd__online_cpus },
 	{ "pack-mtimes", cmd__pack_mtimes },
 	{ "parse-options", cmd__parse_options },
diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index 21802ac27da..26ff30a5a9a 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h
@@ -37,6 +37,7 @@ int cmd__lazy_init_name_hash(int argc, const char **argv);
 int cmd__match_trees(int argc, const char **argv);
 int cmd__mergesort(int argc, const char **argv);
 int cmd__mktemp(int argc, const char **argv);
+int cmd__name_hash(int argc, const char **argv);
 int cmd__online_cpus(int argc, const char **argv);
 int cmd__pack_mtimes(int argc, const char **argv);
 int cmd__parse_options(int argc, const char **argv);
diff --git a/t/perf/p5314-name-hash.sh b/t/perf/p5314-name-hash.sh
new file mode 100755
index 00000000000..9fe26612fac
--- /dev/null
+++ b/t/perf/p5314-name-hash.sh
@@ -0,0 +1,41 @@
+#!/bin/sh
+
+test_description='Tests pack performance using bitmaps'
+. ./perf-lib.sh
+
+GIT_TEST_PASSING_SANITIZE_LEAK=0
+export GIT_TEST_PASSING_SANITIZE_LEAK
+
+test_perf_large_repo
+
+test_size 'paths at head' '
+	git ls-tree -r --name-only HEAD >path-list &&
+	wc -l <path-list
+'
+
+test_size 'number of distinct name-hashes' '
+	cat path-list | test-tool name-hash >name-hashes &&
+	cat name-hashes | awk "{ print \$1; }" | sort -n | uniq -c >name-hash-count &&
+	wc -l <name-hash-count
+'
+
+test_size 'number of distinct full-name-hashes' '
+	cat name-hashes | awk "{ print \$2; }" | sort -n | uniq -c >full-name-hash-count &&
+	wc -l <full-name-hash-count
+'
+
+test_size 'maximum multiplicity of name-hashes' '
+	cat name-hash-count | \
+		sort -nr | \
+		head -n 1 | \
+		awk "{ print \$1; }"
+'
+
+test_size 'maximum multiplicity of fullname-hashes' '
+	cat full-name-hash-count | \
+		sort -nr | \
+		head -n 1 | \
+		awk "{ print \$1; }"
+'
+
+test_done
diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
index a6de7c57643..5759710a2cd 100755
--- a/t/t5310-pack-bitmaps.sh
+++ b/t/t5310-pack-bitmaps.sh
@@ -26,6 +26,32 @@ has_any () {
 	grep -Ff "$1" "$2"
 }
 
+# Since name-hash values are stored in the .bitmap files, add a test
+# that checks that the name-hash calculations are stable across versions.
+# Not exhaustive, but these hashing algorithms would be hard to change
+# without causing deviations here.
+test_expect_success 'name-hash value stability' '
+	cat >names <<-\EOF &&
+	first
+	second
+	third
+	one-long-enough-for-collisions
+	two-long-enough-for-collisions
+	EOF
+
+	test-tool name-hash <names >out &&
+
+	cat >expect <<-\EOF &&
+	2582249472	3109209818	first
+	2289942528	3781118409	second
+	2300837888	3028707182	third
+	2544516325	3241327563	one-long-enough-for-collisions
+	2544516325	4207880830	two-long-enough-for-collisions
+	EOF
+
+	test_cmp expect out
+'
+
 test_bitmap_cases () {
 	writeLookupTable=false
 	for i in "$@"
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 0/6] pack-objects: create new name-hash algorithm
  2024-09-18 20:46 ` [PATCH v2 0/6] " Derrick Stolee via GitGitGadget
                     ` (5 preceding siblings ...)
  2024-09-18 20:46   ` [PATCH v2 6/6] test-tool: add helper for name-hash values Derrick Stolee via GitGitGadget
@ 2024-09-18 23:30   ` Derrick Stolee
  2024-10-04 21:17   ` Junio C Hamano
  2024-10-08 18:32   ` Derrick Stolee
  8 siblings, 0 replies; 38+ messages in thread
From: Derrick Stolee @ 2024-09-18 23:30 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget, git
  Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren

On 9/18/24 4:46 PM, Derrick Stolee via GitGitGadget wrote:
...
 > Other things that have happened include investigations into ways to adapt
 > the full-name hash to improve upon the name-hash. I did some experimenting
 > with increasing the size of 'struct object_entry' by using a 64-bit hash
 > value (name-hash, then full-name-hash) for a single-pass compression or two
 > 32-bit hash values for a two-pass compression process. I include my WIP
 > branch at [3] to show what I tried, though the single-pass option did not
 > present any improvements and the two-pass option seems to be broken to the
 > point that the compression is substantially worse. (I'll try to elaborate on
 > this in a reply to this cover letter.)
 >
 > [3] 
https://github.com/derrickstolee/git/compare/full-name...derrickstolee:git:full-name-wip

To break down what I attempted in [3], let me break down a few things.

First, I tried using a 64-bit hash value [1]. This used the standard name-hash
as the most-significant digits and the full-name-hash as the least-significant
digits. The goal here was to still have locality from the name-hash but get a
good partition based on full-name-hash within those collisions.

However, when sorting this way, the boundaries of the full-name-hash partitions
are ineffective at getting good delta bases because the largest object from one
full-name-hash set is next to the smallest object from the next full-name-hash
set. Even when a full-name-hash set has size one, it is sorted roughly randomly
among the other colliding path names instead of grouped nicely with objects of
a similar size. This makes the results nearly identical to the 32-bit
full-name-hash implementation.

[1] 
https://github.com/derrickstolee/git/commit/aaa6befa3016667ea5eb10fdd6aa2b7fcec3a52e

Second, I tried storing two 32-bit hashes and doing a two-pass delta search [2].
In theory, this should be very similar to the --path-walk feature from the RFC.
However, I failed to make it work. Something about this version of a two-pass
walk was hitting some strange behavior. For example, I had to put in this extra
condition [4] if a best delta base was not found, or else we could get a
segfault.

[2] 
https://github.com/derrickstolee/git/commit/bf71271040ab93a624a8cdf5bc8aaff68e9b1b17

[4] 
https://github.com/derrickstolee/git/commit/fedc4fc543e50563f4748a5ffc45b51b530023e0

In fact, the results were not just _bad_ but they were _significantly worse_.

It took me a long time to report these details because they just didn't make
sense and I couldn't figure out what was going wrong. I'd be very grateful to
anyone who could explore these WIP commits and point out what I'm doing wrong
so I can learn and maybe we can get a boost to the feature.

Even if we had strong data from these examples, I'm not sure that we'd want
to add four bytes per object to the packing data, especially in a way that
impacts users that aren't even using the new feature. We would want to
explore options that use some kind of hashtable to map objects to their
64-bit hash values, perhaps. It also affects the .bitmap file format, which
would need modification even for a new 32-bit hash function (though one of
the same size could be used by adding an extension saying "I'm using hash
function v2" and leave the rest of the structure the same).

I would also like to test the performance against the threaded version of the
--path-walk feature, which I recently got working in my prototype branch [5].

[5] 
https://github.com/derrickstolee/git/pull/28/commits/a9fc233390ae00e3d4b156be64d6b3974e30d8a1

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/6] pack-objects: add --full-name-hash option
  2024-09-18 20:46   ` [PATCH v2 1/6] pack-objects: add --full-name-hash option Derrick Stolee via GitGitGadget
@ 2024-09-19 21:56     ` Junio C Hamano
  0 siblings, 0 replies; 38+ messages in thread
From: Junio C Hamano @ 2024-09-19 21:56 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, johannes.schindelin, peff, ps, me, johncai86, newren,
	Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +static inline uint32_t pack_full_name_hash(const char *name)
> +{
> +	const uint32_t bigp = 1234572167U;
> +	uint32_t c, hash = bigp;
> +
> +	if (!name)
> +		return 0;
> +
> +	/*
> +	 * Do the simplest thing that will resemble pseduo-randomness: add

"pseduo" -> "pseudo"

> +	 * random multiples of a large prime number with a binary shift.
> +	 * The goal is not to be cryptographic, but to be generally
> +	 * uniformly distributed.
> +	 */

Other than that, there is no substantial difference since the
previous iteration; this step looks good.

Thanks.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 5/6] p5313: add size comparison test
  2024-09-18 20:46   ` [PATCH v2 5/6] p5313: add size comparison test Derrick Stolee via GitGitGadget
@ 2024-09-19 21:58     ` Junio C Hamano
  2024-09-23  1:50       ` Derrick Stolee
  0 siblings, 1 reply; 38+ messages in thread
From: Junio C Hamano @ 2024-09-19 21:58 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, johannes.schindelin, peff, ps, me, johncai86, newren,
	Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <stolee@gmail.com>
>
> As custom options are added to 'git pack-objects' and 'git repack' to
> adjust how compression is done, use this new performance test script to
> demonstrate their effectiveness in performance and size.
>
> The recently-added --full-name-hash option swaps the default name-hash
> algorithm with one that attempts to uniformly distribute the hashes
> based on the full path name instead of the last 16 characters.
>
> This has a dramatic effect on full repacks for repositories with many
> versions of most paths. It can have a negative impact on cases such as
> pushing a single change.
>
> This can be seen by running pt5313 on the open source fluentui
> repository [1]. Most commits will have this kind of output for the thin
> and big pack cases, though certain commits (such as [2]) will have
> problematic thin pack size for other reasons.
>
> [1] https://github.com/microsoft/fluentui
> [2] a637a06df05360ce5ff21420803f64608226a875
>
> Checked out at the parent of [2], I see the following statistics:
>
> Test                                           this tree
> ------------------------------------------------------------------
> 5313.2: thin pack                              0.02(0.01+0.01)
> 5313.3: thin pack size                                    1.1K
> 5313.4: thin pack with --full-name-hash        0.02(0.01+0.00)
> 5313.5: thin pack size with --full-name-hash              3.0K
> 5313.6: big pack                               1.65(3.35+0.24)
> 5313.7: big pack size                                    58.0M
> 5313.8: big pack with --full-name-hash         1.53(2.52+0.18)
> 5313.9: big pack size with --full-name-hash              57.6M
> 5313.10: repack                                176.52(706.60+3.53)
> 5313.11: repack size                                    446.7K
> 5313.12: repack with --full-name-hash          37.47(134.18+3.06)
> 5313.13: repack size with --full-name-hash              183.1K
>
> Note that this demonstrates a 3x size _increase_ in the case that
> simulates a small "git push". The size change is neutral on the case of
> pushing the difference between HEAD and HEAD~1000.
>
> However, the full repack case is both faster and more efficient.

Nice.

> Signed-off-by: Derrick Stolee <stolee@gmail.com>
> ---
>  t/perf/p5313-pack-objects.sh | 71 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 71 insertions(+)
>  create mode 100755 t/perf/p5313-pack-objects.sh

"wc -c" -> "test_file_size" or "test-tool path-utils file-size"?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 2/6] repack: test --full-name-hash option
  2024-09-18 20:46   ` [PATCH v2 2/6] repack: test " Derrick Stolee via GitGitGadget
@ 2024-09-19 22:11     ` Junio C Hamano
  0 siblings, 0 replies; 38+ messages in thread
From: Junio C Hamano @ 2024-09-19 22:11 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, johannes.schindelin, peff, ps, me, johncai86, newren,
	Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> The existing test_subcommand method requires an exact list of arguments
> for the subcommand. This is too rigid for our needs here, so create a
> new method, test_subcommand_flex. Use it to check that the
> --full-name-hash option is passing through.

This is something I found need for in the past at least a few times.

> +# Check that the given subcommand was run with the given set of
> +# arguments in order (but with possible extra arguments).
> +#
> +#	test_subcommand_flex [!] <command> <args>... < <trace>
> +#
> +# If the first parameter passed is !, this instead checks that
> +# the given command was not called.
> +#
> +test_subcommand_flex () {
> +	local negate=
> +	if test "$1" = "!"
> +	then
> +		negate=t
> +		shift
> +	fi
> +
> +	local expr="$(printf '"%s".*' "$@")"

OK, so it works exactly like the comment said.  You allow arbigrary
garbage in between the given parameters that come in the order
given.  As long as this is used to make sure that the flags are
passed through, by somebody who knows how the code constructs the
command line, this should be fine, as we won't be permuting the
command line parameters.

Looking good.


> +	if test -n "$negate"
> +	then
> +		! grep "\[$expr\]"
> +	else
> +		grep "\[$expr\]"
> +	fi
> +}
> +
>  # Check that the given command was invoked as part of the
>  # trace2-format trace on stdin.
>  #

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 3/6] pack-objects: add GIT_TEST_FULL_NAME_HASH
  2024-09-18 20:46   ` [PATCH v2 3/6] pack-objects: add GIT_TEST_FULL_NAME_HASH Derrick Stolee via GitGitGadget
@ 2024-09-19 22:22     ` Junio C Hamano
  2024-09-23  1:39       ` Derrick Stolee
  0 siblings, 1 reply; 38+ messages in thread
From: Junio C Hamano @ 2024-09-19 22:22 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, johannes.schindelin, peff, ps, me, johncai86, newren,
	Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <stolee@gmail.com>
>
> Add a new environment variable to opt-in to the --full-name-hash option
> in 'git pack-objects'. This allows for extra testing of the feature
> without repeating all of the test scenarios.

This also allows the programmer on the C implementation side to be a
bit lazy, as the --full-name-hash option does not have to be plumbed
through from the end-user facing commands (like "bundle") down to
the underlying "pack-objects" command ;-).

As an end-user facing tweak mechanism, an environment variable is
the most clunky, followed by a configuration variable (which can be
used via "git -c" and exhibits the same clunkiness as an environment
variable), and a command line parameter is the most versatile in
allowing users to customize the behaviour per-invocation of the
commands.  So in the longer term, we probably want to plumb through
the option, like you did for "repack -> pack-objects" call chain,
for all end-user visible commands that call into pack-objects.

But for testing purposes, the solution presented here is of course
good enough.

> Second, there are two tests in t5616-partial-clone.sh that I believe are
> actually broken scenarios. While the client is set up to clone the
> 'promisor-server' repo via a treeless partial clone filter (tree:0),
> that filter does not translate to the 'server' repo. Thus, fetching from
> these repos causes the server to think that the client has all reachable
> trees and blobs from the commits advertised as 'haves'. This leads the
> server to providing a thin pack assuming those objects as delta bases.

In short, the tests are based on broken assumption and checking
bogus outcome?  Somebody familiar with the partial clone area should
probably take a look into it and fix the tests if that is the case.

> -	if (write_bitmap_index && use_full_name_hash)
> +	if (write_bitmap_index && use_full_name_hash > 0)
>  		die(_("currently, the --full-name-hash option is incompatible with --write-bitmap-index"));
> +	if (use_full_name_hash < 0)
> +		use_full_name_hash = git_env_bool("GIT_TEST_FULL_NAME_HASH", 0);

OK.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 3/6] pack-objects: add GIT_TEST_FULL_NAME_HASH
  2024-09-19 22:22     ` Junio C Hamano
@ 2024-09-23  1:39       ` Derrick Stolee
  0 siblings, 0 replies; 38+ messages in thread
From: Derrick Stolee @ 2024-09-23  1:39 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, johannes.schindelin, peff, ps, me, johncai86, newren

On 9/19/24 6:22 PM, Junio C Hamano wrote:
 > "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

 >> Second, there are two tests in t5616-partial-clone.sh that I believe are
 >> actually broken scenarios. While the client is set up to clone the
 >> 'promisor-server' repo via a treeless partial clone filter (tree:0),
 >> that filter does not translate to the 'server' repo. Thus, fetching from
 >> these repos causes the server to think that the client has all reachable
 >> trees and blobs from the commits advertised as 'haves'. This leads the
 >> server to providing a thin pack assuming those objects as delta bases.
 >
 > In short, the tests are based on broken assumption and checking
 > bogus outcome?  Somebody familiar with the partial clone area should
 > probably take a look into it and fix the tests if that is the case.

That is my understanding, yes. It is a common issue that I've had when
using partial clones with multiple remotes and forgetting to modify the
filter for each, so is also a usability wart that could use mending
(say, by adding the existing filter when adding a new remote).

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 5/6] p5313: add size comparison test
  2024-09-19 21:58     ` Junio C Hamano
@ 2024-09-23  1:50       ` Derrick Stolee
  2024-09-23 16:14         ` Junio C Hamano
  0 siblings, 1 reply; 38+ messages in thread
From: Derrick Stolee @ 2024-09-23  1:50 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, johannes.schindelin, peff, ps, me, johncai86, newren

On 9/19/24 5:58 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

>>   t/perf/p5313-pack-objects.sh | 71 ++++++++++++++++++++++++++++++++++++
>>   1 file changed, 71 insertions(+)
>>   create mode 100755 t/perf/p5313-pack-objects.sh
> 
> "wc -c" -> "test_file_size" or "test-tool path-utils file-size"?

Thanks. I was unfamiliar with those options.

I will squash this change into the next version. The use of a
variable for the packfile name was important at some point while
I was testing this with various repos on different platforms.

--- >8 ---

diff --git a/t/perf/p5313-pack-objects.sh b/t/perf/p5313-pack-objects.sh
index 5dcf52acb0..bf6f0d69e4 100755
--- a/t/perf/p5313-pack-objects.sh
+++ b/t/perf/p5313-pack-objects.sh
@@ -25,7 +25,7 @@ test_perf 'thin pack' '
  '

  test_size 'thin pack size' '
-	wc -c <out
+	test_file_size out
  '

  test_perf 'thin pack with --full-name-hash' '
@@ -33,7 +33,7 @@ test_perf 'thin pack with --full-name-hash' '
  '

  test_size 'thin pack size with --full-name-hash' '
-	wc -c <out
+	test_file_size out
  '

  test_perf 'big pack' '
@@ -41,7 +41,7 @@ test_perf 'big pack' '
  '

  test_size 'big pack size' '
-	wc -c <out
+	test_file_size out
  '

  test_perf 'big pack with --full-name-hash' '
@@ -49,7 +49,7 @@ test_perf 'big pack with --full-name-hash' '
  '

  test_size 'big pack size with --full-name-hash' '
-	wc -c <out
+	test_file_size out
  '

  test_perf 'repack' '
@@ -57,7 +57,8 @@ test_perf 'repack' '
  '

  test_size 'repack size' '
-	wc -c <.git/objects/pack/pack-*.pack
+	pack=$(ls .git/objects/pack/pack-*.pack) &&
+	test_file_size "$pack"
  '

  test_perf 'repack with --full-name-hash' '
@@ -65,7 +66,8 @@ test_perf 'repack with --full-name-hash' '
  '

  test_size 'repack size with --full-name-hash' '
-	wc -c <.git/objects/pack/pack-*.pack
+	pack=$(ls .git/objects/pack/pack-*.pack) &&
+	test_file_size "$pack"
  '

  test_done


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 5/6] p5313: add size comparison test
  2024-09-23  1:50       ` Derrick Stolee
@ 2024-09-23 16:14         ` Junio C Hamano
  0 siblings, 0 replies; 38+ messages in thread
From: Junio C Hamano @ 2024-09-23 16:14 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, johannes.schindelin, peff,
	ps, me, johncai86, newren

Derrick Stolee <stolee@gmail.com> writes:

> On 9/19/24 5:58 PM, Junio C Hamano wrote:
>> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
>>>   t/perf/p5313-pack-objects.sh | 71 ++++++++++++++++++++++++++++++++++++
>>>   1 file changed, 71 insertions(+)
>>>   create mode 100755 t/perf/p5313-pack-objects.sh
>> "wc -c" -> "test_file_size" or "test-tool path-utils file-size"?
>
> Thanks. I was unfamiliar with those options.

I was, too ;-).  There are a few other test pieces that use "wc -c"
to count bytes, but it made me a bit nervous to use it to count
bytes in a binary file.

>  test_size 'repack size' '
> -	wc -c <.git/objects/pack/pack-*.pack
> +	pack=$(ls .git/objects/pack/pack-*.pack) &&
> +	test_file_size "$pack"
>  '

I am perfectly fine (and actually even prefer) to pay the cost to
fork "ls" here, but if you really wanted to avoid it, we could do
something like

	pack=$(
		set x .git/objects/pack/pack-*.pack &&
		test $# = 2 && echo "$2" || exit 1
	) &&
	test_file_size "$pack"

;-)

>  test_perf 'repack with --full-name-hash' '
> @@ -65,7 +66,8 @@ test_perf 'repack with --full-name-hash' '
>  '
>
>  test_size 'repack size with --full-name-hash' '
> -	wc -c <.git/objects/pack/pack-*.pack
> +	pack=$(ls .git/objects/pack/pack-*.pack) &&
> +	test_file_size "$pack"
>  '
>
>  test_done

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 6/6] test-tool: add helper for name-hash values
  2024-09-18 20:46   ` [PATCH v2 6/6] test-tool: add helper for name-hash values Derrick Stolee via GitGitGadget
@ 2024-09-24  7:02     ` Patrick Steinhardt
  2024-09-25 13:35       ` Derrick Stolee
  0 siblings, 1 reply; 38+ messages in thread
From: Patrick Steinhardt @ 2024-09-24  7:02 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, gitster, johannes.schindelin, peff, me, johncai86, newren,
	Derrick Stolee

On Wed, Sep 18, 2024 at 08:46:21PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <stolee@gmail.com>
> diff --git a/t/helper/test-name-hash.c b/t/helper/test-name-hash.c
> new file mode 100644
> index 00000000000..15fb8f853c1
> --- /dev/null
> +++ b/t/helper/test-name-hash.c
> @@ -0,0 +1,23 @@
> +/*
> + * test-name-hash.c: Read a list of paths over stdin and report on their
> + * name-hash and full name-hash.
> + */
> +
> +#include "test-tool.h"
> +#include "git-compat-util.h"
> +#include "pack-objects.h"
> +#include "strbuf.h"
> +
> +int cmd__name_hash(int argc UNUSED, const char **argv UNUSED)
> +{
> +	struct strbuf line = STRBUF_INIT;
> +
> +	while (!strbuf_getline(&line, stdin)) {
> +		uint32_t name_hash = pack_name_hash(line.buf);
> +		uint32_t full_hash = pack_full_name_hash(line.buf);
> +
> +		printf("%10"PRIu32"\t%10"PRIu32"\t%s\n", name_hash, full_hash, line.buf);
> +	}
> +
> +	return 0;
> +}

This patch breaks t5310 with the leak sanitizer enabled due to the
leaking `struct strbuf line`. It needs the following diff on top:

diff --git a/t/helper/test-name-hash.c b/t/helper/test-name-hash.c
index 15fb8f853c..e4ecd159b7 100644
--- a/t/helper/test-name-hash.c
+++ b/t/helper/test-name-hash.c
@@ -19,5 +19,6 @@ int cmd__name_hash(int argc UNUSED, const char **argv UNUSED)
 		printf("%10"PRIu32"\t%10"PRIu32"\t%s\n", name_hash, full_hash, line.buf);
 	}
 
+	strbuf_release(&line);
 	return 0;
 }

I also plan to eventually have a deeper look at your patch series, but
didn't yet find the time to do so until now :(

Patrick

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 6/6] test-tool: add helper for name-hash values
  2024-09-24  7:02     ` Patrick Steinhardt
@ 2024-09-25 13:35       ` Derrick Stolee
  0 siblings, 0 replies; 38+ messages in thread
From: Derrick Stolee @ 2024-09-25 13:35 UTC (permalink / raw)
  To: Patrick Steinhardt, Derrick Stolee via GitGitGadget
  Cc: git, gitster, johannes.schindelin, peff, me, johncai86, newren

On 9/24/24 3:02 AM, Patrick Steinhardt wrote:

> This patch breaks t5310 with the leak sanitizer enabled due to the
> leaking `struct strbuf line`. It needs the following diff on top:
> 
> diff --git a/t/helper/test-name-hash.c b/t/helper/test-name-hash.c
> index 15fb8f853c..e4ecd159b7 100644
> --- a/t/helper/test-name-hash.c
> +++ b/t/helper/test-name-hash.c
> @@ -19,5 +19,6 @@ int cmd__name_hash(int argc UNUSED, const char **argv UNUSED)
>   		printf("%10"PRIu32"\t%10"PRIu32"\t%s\n", name_hash, full_hash, line.buf);
>   	}
>   
> +	strbuf_release(&line);
>   	return 0;
>   }

Thanks! I'll make sure this gets in the next version.

> I also plan to eventually have a deeper look at your patch series, but
> didn't yet find the time to do so until now :(
Thanks for taking the time, when you have it.

-Stolee


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 0/6] pack-objects: create new name-hash algorithm
  2024-09-18 20:46 ` [PATCH v2 0/6] " Derrick Stolee via GitGitGadget
                     ` (6 preceding siblings ...)
  2024-09-18 23:30   ` [PATCH v2 0/6] pack-objects: create new name-hash algorithm Derrick Stolee
@ 2024-10-04 21:17   ` Junio C Hamano
  2024-10-04 22:30     ` Taylor Blau
  2024-10-04 22:35     ` Jeff King
  2024-10-08 18:32   ` Derrick Stolee
  8 siblings, 2 replies; 38+ messages in thread
From: Junio C Hamano @ 2024-10-04 21:17 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, johannes.schindelin, peff, ps, me, johncai86, newren,
	Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> I've been focused recently on understanding and mitigating the growth of a
> few internal repositories. Some of these are growing much larger than
> expected for the number of contributors, and there are multiple aspects to
> why this growth is so large.
> ...
> Derrick Stolee (6):
>   pack-objects: add --full-name-hash option
>   repack: test --full-name-hash option
>   pack-objects: add GIT_TEST_FULL_NAME_HASH
>   git-repack: update usage to match docs
>   p5313: add size comparison test
>   test-tool: add helper for name-hash values

Recent CI jobs on 'seen' has been failing the leak-check jobs, e.g.

  https://github.com/git/git/actions/runs/11184876759/job/31096601111#step:4:1886

shows that t5310 and t5334 are having problems.

I randomly picked this topic and ejected it out of 'seen', and the
result is fairing a bit better. t5310 that failed 228/228 subtests
in the above run is now passing.  I didn't run this topic alone
under the leak checker, so it is entirely possible that there are
some unfortunate interactions with other topics in flight.

t5334 still fails 8/16 subtests just like the above run, exonerating
this topic for its breakage.

  https://github.com/git/git/actions/runs/11186737635/job/31102378478#step:4:1880


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 0/6] pack-objects: create new name-hash algorithm
  2024-10-04 21:17   ` Junio C Hamano
@ 2024-10-04 22:30     ` Taylor Blau
  2024-10-04 22:35     ` Jeff King
  1 sibling, 0 replies; 38+ messages in thread
From: Taylor Blau @ 2024-10-04 22:30 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Derrick Stolee via GitGitGadget, git, johannes.schindelin, peff,
	ps, johncai86, newren, Derrick Stolee

On Fri, Oct 04, 2024 at 02:17:30PM -0700, Junio C Hamano wrote:
> t5334 still fails 8/16 subtests just like the above run, exonerating
> this topic for its breakage.
>
>   https://github.com/git/git/actions/runs/11186737635/job/31102378478#step:4:1880

This is not all too surprising, since part two of the incremental MIDX
topic introduces some new leaks which were not seen by Patrick's recent
leakfixes. So I expect that this broke with 9d4855eef3 (midx-write: fix
leaking buffer, 2024-09-30), which marked t5534 as leak-free.

And that's true, without the patches from tb/incremental-midx-part-2.
The leak stems from the fact that the 'bitmap_index' struct now has a
new optional "base" pointer, which points to another 'bitmap_index'
corresponding to the previous MIDX layer.

The fix is fairly straightforward, and should be limited to:

--- 8< ---
Subject: [PATCH] fixup! pack-bitmap.c: open and store incremental bitmap
 layers

The incremental MIDX bitmap work was done prior to 9d4855eef3
(midx-write: fix leaking buffer, 2024-09-30), and causes test failures
in t5334 in a post-9d4855eef3 world.

The leak looks like:

    Direct leak of 264 byte(s) in 1 object(s) allocated from:
        #0 0x7f6bcd87eaca in calloc ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:90
        #1 0x55ad1428e8a4 in xcalloc wrapper.c:151
        #2 0x55ad14199e16 in prepare_midx_bitmap_git pack-bitmap.c:742
        #3 0x55ad14199447 in open_midx_bitmap_1 pack-bitmap.c:507
        #4 0x55ad14199cca in open_midx_bitmap pack-bitmap.c:704
        #5 0x55ad14199d44 in open_bitmap pack-bitmap.c:717
        #6 0x55ad14199dc2 in prepare_bitmap_git pack-bitmap.c:733
        #7 0x55ad1419e496 in test_bitmap_walk pack-bitmap.c:2698
        #8 0x55ad14047b0b in cmd_rev_list builtin/rev-list.c:629
        #9 0x55ad13f71cd6 in run_builtin git.c:487
        #10 0x55ad13f72132 in handle_builtin git.c:756
        #11 0x55ad13f72380 in run_argv git.c:826
        #12 0x55ad13f728f4 in cmd_main git.c:961
        #13 0x55ad1407d3ae in main common-main.c:64
        #14 0x7f6bcd5f0c89 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
        #15 0x7f6bcd5f0d44 in __libc_start_main_impl ../csu/libc-start.c:360
        #16 0x55ad13f6ff90 in _start (git+0x1ef90) (BuildId: 3e63cdd415f1d185b21da3035cb48332510dddce)

, and is a result of us not freeing the resources corresponding to the
bitmap's base layer, if one was present.

Rectify that leak by calling the newly-introduced free_bitmap_index()
function on the base layer to ensure that its resources are also freed.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-bitmap.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index 2b4c3972e5..fe0df2b6c3 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -3045,6 +3045,7 @@ void free_bitmap_index(struct bitmap_index *b)
 		close_midx_revindex(b->midx);
 	}
 	free_pseudo_merge_map(&b->pseudo_merges);
+	free_bitmap_index(b->base);
 	free(b);
 }

--
2.47.0.rc1.154.ge6e38f19f35.dirty
--- >8 ---

That should apply on top of 'seen' easily, and will at least fix the
leak failures you pointed out here.

When the integration branches are rewound after 2.47.0 is tagged, I'll
send a new version of this topic based on Patrick's work here.

Thanks,
Taylor

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 0/6] pack-objects: create new name-hash algorithm
  2024-10-04 21:17   ` Junio C Hamano
  2024-10-04 22:30     ` Taylor Blau
@ 2024-10-04 22:35     ` Jeff King
  1 sibling, 0 replies; 38+ messages in thread
From: Jeff King @ 2024-10-04 22:35 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Derrick Stolee via GitGitGadget, git, johannes.schindelin, ps, me,
	johncai86, newren, Derrick Stolee

On Fri, Oct 04, 2024 at 02:17:30PM -0700, Junio C Hamano wrote:

> > Derrick Stolee (6):
> >   pack-objects: add --full-name-hash option
> >   repack: test --full-name-hash option
> >   pack-objects: add GIT_TEST_FULL_NAME_HASH
> >   git-repack: update usage to match docs
> >   p5313: add size comparison test
> >   test-tool: add helper for name-hash values
> 
> Recent CI jobs on 'seen' has been failing the leak-check jobs, e.g.
> 
>   https://github.com/git/git/actions/runs/11184876759/job/31096601111#step:4:1886
> 
> shows that t5310 and t5334 are having problems.
> 
> I randomly picked this topic and ejected it out of 'seen', and the
> result is fairing a bit better. t5310 that failed 228/228 subtests
> in the above run is now passing.  I didn't run this topic alone
> under the leak checker, so it is entirely possible that there are
> some unfortunate interactions with other topics in flight.

Maybe squash this into the final patch of that series?

diff --git a/t/helper/test-name-hash.c b/t/helper/test-name-hash.c
index 15fb8f853c..e4ecd159b7 100644
--- a/t/helper/test-name-hash.c
+++ b/t/helper/test-name-hash.c
@@ -19,5 +19,6 @@ int cmd__name_hash(int argc UNUSED, const char **argv UNUSED)
 		printf("%10"PRIu32"\t%10"PRIu32"\t%s\n", name_hash, full_hash, line.buf);
 	}
 
+	strbuf_release(&line);
 	return 0;
 }

That seems to be enough to clear t5310 on "seen". It was not noticed in
the original topic because t5310 was not yet marked as leak-free in its
base. That happened in fa016423c7 (revision: fix leaking parents when
simplifying commits, 2024-09-26)

-Peff

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 0/6] pack-objects: create new name-hash algorithm
  2024-09-18 20:46 ` [PATCH v2 0/6] " Derrick Stolee via GitGitGadget
                     ` (7 preceding siblings ...)
  2024-10-04 21:17   ` Junio C Hamano
@ 2024-10-08 18:32   ` Derrick Stolee
  8 siblings, 0 replies; 38+ messages in thread
From: Derrick Stolee @ 2024-10-08 18:32 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget, git
  Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren

On 9/18/24 4:46 PM, Derrick Stolee via GitGitGadget wrote:
> I've been focused recently on understanding and mitigating the growth of a
> few internal repositories. Some of these are growing much larger than
> expected for the number of contributors, and there are multiple aspects to
> why this growth is so large.
> 
> This is part of the RFC I submitted [1] [2] involving the path-walk API,
> though this doesn't use the path-walk API directly. In full repack cases, it
> seems that the --full-name-hash option gets nearly as good compression as
> the --path-walk option introduced in that series. I continue to work on that
> feature as well, so we can review it after this series is complete.
Based on the discussion in this thread, I recommend pulling this branch out
of 'seen' and instead focus on the --path-walk option which is now available
at [3]. If we choose to revisit the --full-name-hash option, then that can
be done on top of that feature which is probably a higher priority.

[3] https://lore.kernel.org/git/pull.1813.git.1728396723.gitgitgadget@gmail.com

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2024-10-08 18:32 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-09 13:56 [PATCH 0/4] pack-objects: create new name-hash algorithm Derrick Stolee via GitGitGadget
2024-09-09 13:56 ` [PATCH 1/4] pack-objects: add --full-name-hash option Derrick Stolee via GitGitGadget
2024-09-09 17:45   ` Junio C Hamano
2024-09-10  2:31     ` Derrick Stolee
2024-09-09 13:56 ` [PATCH 2/4] git-repack: update usage to match docs Derrick Stolee via GitGitGadget
2024-09-09 17:48   ` Junio C Hamano
2024-09-09 13:56 ` [PATCH 3/4] p5313: add size comparison test Derrick Stolee via GitGitGadget
2024-09-09 13:56 ` [PATCH 4/4] p5314: add a size test for name-hash collisions Derrick Stolee via GitGitGadget
2024-09-09 16:07 ` [PATCH 0/4] pack-objects: create new name-hash algorithm Derrick Stolee
2024-09-09 18:06 ` Junio C Hamano
2024-09-10  2:37   ` Derrick Stolee
2024-09-10 14:56     ` Taylor Blau
2024-09-10 21:05       ` Derrick Stolee
2024-09-11  6:32         ` Jeff King
2024-09-10 16:07     ` Junio C Hamano
2024-09-10 20:36       ` Junio C Hamano
2024-09-10 21:09         ` Derrick Stolee
2024-09-18 20:46 ` [PATCH v2 0/6] " Derrick Stolee via GitGitGadget
2024-09-18 20:46   ` [PATCH v2 1/6] pack-objects: add --full-name-hash option Derrick Stolee via GitGitGadget
2024-09-19 21:56     ` Junio C Hamano
2024-09-18 20:46   ` [PATCH v2 2/6] repack: test " Derrick Stolee via GitGitGadget
2024-09-19 22:11     ` Junio C Hamano
2024-09-18 20:46   ` [PATCH v2 3/6] pack-objects: add GIT_TEST_FULL_NAME_HASH Derrick Stolee via GitGitGadget
2024-09-19 22:22     ` Junio C Hamano
2024-09-23  1:39       ` Derrick Stolee
2024-09-18 20:46   ` [PATCH v2 4/6] git-repack: update usage to match docs Derrick Stolee via GitGitGadget
2024-09-18 20:46   ` [PATCH v2 5/6] p5313: add size comparison test Derrick Stolee via GitGitGadget
2024-09-19 21:58     ` Junio C Hamano
2024-09-23  1:50       ` Derrick Stolee
2024-09-23 16:14         ` Junio C Hamano
2024-09-18 20:46   ` [PATCH v2 6/6] test-tool: add helper for name-hash values Derrick Stolee via GitGitGadget
2024-09-24  7:02     ` Patrick Steinhardt
2024-09-25 13:35       ` Derrick Stolee
2024-09-18 23:30   ` [PATCH v2 0/6] pack-objects: create new name-hash algorithm Derrick Stolee
2024-10-04 21:17   ` Junio C Hamano
2024-10-04 22:30     ` Taylor Blau
2024-10-04 22:35     ` Jeff King
2024-10-08 18:32   ` Derrick Stolee

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).