git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com>
To: git@vger.kernel.org
Cc: gitster@pobox.com, johannes.schindelin@gmx.de, peff@peff.net,
	ps@pks.im, me@ttaylorr.com, johncai86@gmail.com,
	newren@gmail.com, Derrick Stolee <stolee@gmail.com>,
	Derrick Stolee <stolee@gmail.com>
Subject: [PATCH 4/4] p5314: add a size test for name-hash collisions
Date: Mon, 09 Sep 2024 13:56:50 +0000	[thread overview]
Message-ID: <5bcbcce66650e0a4addabcc86fe834a6c9b1761c.1725890211.git.gitgitgadget@gmail.com> (raw)
In-Reply-To: <pull.1785.git.1725890210.gitgitgadget@gmail.com>

From: Derrick Stolee <stolee@gmail.com>

Add a new test-tool helper, name-hash, to output the value of the
name-hash algorithms for the input list of strings, one per line.

Create a performance test that uses test_size to demonstrate how
collisions occur for these hash algorithms. This test helps inform
someone as to the behavior of the name-hash algorithms for their repo
based on the paths at HEAD.

My copy of the Git repository shows modest statistics around the
collisions of the default name-hash algorithm:

Test                                              this tree
-----------------------------------------------------------------
5314.1: paths at head                                        4.5K
5314.2: number of distinct name-hashes                       4.1K
5314.3: number of distinct full-name-hashes                  4.5K
5314.4: maximum multiplicity of name-hashes                    13
5314.5: maximum multiplicity of fullname-hashes                 1

Here, the maximum collision multiplicity is 13, but around 10% of paths
have a collision with another path.

In a more interesting example, the microsoft/fluentui [1] repo had these
statistics at time of committing:

Test                                              this tree
-----------------------------------------------------------------
5314.1: paths at head                                       19.6K
5314.2: number of distinct name-hashes                       8.2K
5314.3: number of distinct full-name-hashes                 19.6K
5314.4: maximum multiplicity of name-hashes                   279
5314.5: maximum multiplicity of fullname-hashes                 1

[1] https://github.com/microsoft/fluentui

That demonstrates that of the nearly twenty thousand path names, they
are assigned around eight thousand distinct values. 279 paths are
assigned to a single value, leading the packing algorithm to sort
objects from those paths together, by size.

In this repository, no collisions occur for the full-name-hash
algorithm.

In a more extreme example, an internal monorepo had a much worse
collision rate:

Test                                              this tree
-----------------------------------------------------------------
5314.1: paths at head                                      221.6K
5314.2: number of distinct name-hashes                      72.0K
5314.3: number of distinct full-name-hashes                221.6K
5314.4: maximum multiplicity of name-hashes                 14.4K
5314.5: maximum multiplicity of fullname-hashes                 2

Even in this repository with many more paths at HEAD, the collision rate
was low and the maximum number of paths being grouped into a single
bucket by the full-path-name algorithm was two.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 Makefile                  |  1 +
 t/helper/test-name-hash.c | 23 ++++++++++++++++++++++
 t/helper/test-tool.c      |  1 +
 t/helper/test-tool.h      |  1 +
 t/perf/p5314-name-hash.sh | 41 +++++++++++++++++++++++++++++++++++++++
 5 files changed, 67 insertions(+)
 create mode 100644 t/helper/test-name-hash.c
 create mode 100755 t/perf/p5314-name-hash.sh

diff --git a/Makefile b/Makefile
index 91f65d7dc57..a5ce284cf4a 100644
--- a/Makefile
+++ b/Makefile
@@ -812,6 +812,7 @@ TEST_BUILTINS_OBJS += test-lazy-init-name-hash.o
 TEST_BUILTINS_OBJS += test-match-trees.o
 TEST_BUILTINS_OBJS += test-mergesort.o
 TEST_BUILTINS_OBJS += test-mktemp.o
+TEST_BUILTINS_OBJS += test-name-hash.o
 TEST_BUILTINS_OBJS += test-oid-array.o
 TEST_BUILTINS_OBJS += test-online-cpus.o
 TEST_BUILTINS_OBJS += test-pack-mtimes.o
diff --git a/t/helper/test-name-hash.c b/t/helper/test-name-hash.c
new file mode 100644
index 00000000000..15fb8f853c1
--- /dev/null
+++ b/t/helper/test-name-hash.c
@@ -0,0 +1,23 @@
+/*
+ * test-name-hash.c: Read a list of paths over stdin and report on their
+ * name-hash and full name-hash.
+ */
+
+#include "test-tool.h"
+#include "git-compat-util.h"
+#include "pack-objects.h"
+#include "strbuf.h"
+
+int cmd__name_hash(int argc UNUSED, const char **argv UNUSED)
+{
+	struct strbuf line = STRBUF_INIT;
+
+	while (!strbuf_getline(&line, stdin)) {
+		uint32_t name_hash = pack_name_hash(line.buf);
+		uint32_t full_hash = pack_full_name_hash(line.buf);
+
+		printf("%10"PRIu32"\t%10"PRIu32"\t%s\n", name_hash, full_hash, line.buf);
+	}
+
+	return 0;
+}
diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
index f8a67df7de9..4a603921002 100644
--- a/t/helper/test-tool.c
+++ b/t/helper/test-tool.c
@@ -43,6 +43,7 @@ static struct test_cmd cmds[] = {
 	{ "match-trees", cmd__match_trees },
 	{ "mergesort", cmd__mergesort },
 	{ "mktemp", cmd__mktemp },
+	{ "name-hash", cmd__name_hash },
 	{ "oid-array", cmd__oid_array },
 	{ "online-cpus", cmd__online_cpus },
 	{ "pack-mtimes", cmd__pack_mtimes },
diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index e74bc0ffd41..56a83bf3aac 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h
@@ -37,6 +37,7 @@ int cmd__lazy_init_name_hash(int argc, const char **argv);
 int cmd__match_trees(int argc, const char **argv);
 int cmd__mergesort(int argc, const char **argv);
 int cmd__mktemp(int argc, const char **argv);
+int cmd__name_hash(int argc, const char **argv);
 int cmd__online_cpus(int argc, const char **argv);
 int cmd__pack_mtimes(int argc, const char **argv);
 int cmd__parse_options(int argc, const char **argv);
diff --git a/t/perf/p5314-name-hash.sh b/t/perf/p5314-name-hash.sh
new file mode 100755
index 00000000000..9fe26612fac
--- /dev/null
+++ b/t/perf/p5314-name-hash.sh
@@ -0,0 +1,41 @@
+#!/bin/sh
+
+test_description='Tests pack performance using bitmaps'
+. ./perf-lib.sh
+
+GIT_TEST_PASSING_SANITIZE_LEAK=0
+export GIT_TEST_PASSING_SANITIZE_LEAK
+
+test_perf_large_repo
+
+test_size 'paths at head' '
+	git ls-tree -r --name-only HEAD >path-list &&
+	wc -l <path-list
+'
+
+test_size 'number of distinct name-hashes' '
+	cat path-list | test-tool name-hash >name-hashes &&
+	cat name-hashes | awk "{ print \$1; }" | sort -n | uniq -c >name-hash-count &&
+	wc -l <name-hash-count
+'
+
+test_size 'number of distinct full-name-hashes' '
+	cat name-hashes | awk "{ print \$2; }" | sort -n | uniq -c >full-name-hash-count &&
+	wc -l <full-name-hash-count
+'
+
+test_size 'maximum multiplicity of name-hashes' '
+	cat name-hash-count | \
+		sort -nr | \
+		head -n 1 | \
+		awk "{ print \$1; }"
+'
+
+test_size 'maximum multiplicity of fullname-hashes' '
+	cat full-name-hash-count | \
+		sort -nr | \
+		head -n 1 | \
+		awk "{ print \$1; }"
+'
+
+test_done
-- 
gitgitgadget

  parent reply	other threads:[~2024-09-09 13:56 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-09 13:56 [PATCH 0/4] pack-objects: create new name-hash algorithm Derrick Stolee via GitGitGadget
2024-09-09 13:56 ` [PATCH 1/4] pack-objects: add --full-name-hash option Derrick Stolee via GitGitGadget
2024-09-09 17:45   ` Junio C Hamano
2024-09-10  2:31     ` Derrick Stolee
2024-09-09 13:56 ` [PATCH 2/4] git-repack: update usage to match docs Derrick Stolee via GitGitGadget
2024-09-09 17:48   ` Junio C Hamano
2024-09-09 13:56 ` [PATCH 3/4] p5313: add size comparison test Derrick Stolee via GitGitGadget
2024-09-09 13:56 ` Derrick Stolee via GitGitGadget [this message]
2024-09-09 16:07 ` [PATCH 0/4] pack-objects: create new name-hash algorithm Derrick Stolee
2024-09-09 18:06 ` Junio C Hamano
2024-09-10  2:37   ` Derrick Stolee
2024-09-10 14:56     ` Taylor Blau
2024-09-10 21:05       ` Derrick Stolee
2024-09-11  6:32         ` Jeff King
2024-09-10 16:07     ` Junio C Hamano
2024-09-10 20:36       ` Junio C Hamano
2024-09-10 21:09         ` Derrick Stolee
2024-09-18 20:46 ` [PATCH v2 0/6] " Derrick Stolee via GitGitGadget
2024-09-18 20:46   ` [PATCH v2 1/6] pack-objects: add --full-name-hash option Derrick Stolee via GitGitGadget
2024-09-19 21:56     ` Junio C Hamano
2024-09-18 20:46   ` [PATCH v2 2/6] repack: test " Derrick Stolee via GitGitGadget
2024-09-19 22:11     ` Junio C Hamano
2024-09-18 20:46   ` [PATCH v2 3/6] pack-objects: add GIT_TEST_FULL_NAME_HASH Derrick Stolee via GitGitGadget
2024-09-19 22:22     ` Junio C Hamano
2024-09-23  1:39       ` Derrick Stolee
2024-09-18 20:46   ` [PATCH v2 4/6] git-repack: update usage to match docs Derrick Stolee via GitGitGadget
2024-09-18 20:46   ` [PATCH v2 5/6] p5313: add size comparison test Derrick Stolee via GitGitGadget
2024-09-19 21:58     ` Junio C Hamano
2024-09-23  1:50       ` Derrick Stolee
2024-09-23 16:14         ` Junio C Hamano
2024-09-18 20:46   ` [PATCH v2 6/6] test-tool: add helper for name-hash values Derrick Stolee via GitGitGadget
2024-09-24  7:02     ` Patrick Steinhardt
2024-09-25 13:35       ` Derrick Stolee
2024-09-18 23:30   ` [PATCH v2 0/6] pack-objects: create new name-hash algorithm Derrick Stolee
2024-10-04 21:17   ` Junio C Hamano
2024-10-04 22:30     ` Taylor Blau
2024-10-04 22:35     ` Jeff King
2024-10-08 18:32   ` Derrick Stolee

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5bcbcce66650e0a4addabcc86fe834a6c9b1761c.1725890211.git.gitgitgadget@gmail.com \
    --to=gitgitgadget@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=johannes.schindelin@gmx.de \
    --cc=johncai86@gmail.com \
    --cc=me@ttaylorr.com \
    --cc=newren@gmail.com \
    --cc=peff@peff.net \
    --cc=ps@pks.im \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).