From: "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com>
To: git@vger.kernel.org
Cc: gitster@pobox.com, johannes.schindelin@gmx.de, peff@peff.net,
ps@pks.im, me@ttaylorr.com, johncai86@gmail.com,
newren@gmail.com, Derrick Stolee <stolee@gmail.com>
Subject: [PATCH 0/4] pack-objects: create new name-hash algorithm
Date: Mon, 09 Sep 2024 13:56:46 +0000 [thread overview]
Message-ID: <pull.1785.git.1725890210.gitgitgadget@gmail.com> (raw)
I've been focused recently on understanding and mitigating the growth of a
few internal repositories. Some of these are growing much larger than
expected for the number of contributors, and there are multiple aspects to
why this growth is so large.
I will be submitting an RFC with my deep dive into the many aspects of this
issue, but the very last thing I discovered in this space is actually the
easiest change to make.
The main issue plaguing these repositories is that deltas are not being
computed against objects that appear at the same path. While the size of
these files at tip is one aspect of growth that would prevent this issue,
the changes to these files are reasonable and should result in good delta
compression. However, Git is not discovering the connections across
different versions of the same file.
One way to find some improvement in these repositories is to increase the
window size, which was an initial indicator that the delta compression could
be improved, but was not a clear indicator. After some digging (and
prototyping some analysis tools) the main discovery was that the current
name-hash algorithm only considers the last 16 characters in the path name
and has some naturally-occurring collisions within that scope.
This series introduces a new name-hash algorithm, but does not replace the
existing one. There are cases, such as packing a single snapshot of a
repository, where the existing algorithm outperforms the new one.
However, my findings show that when a repository has many versions of files
at the same path (and especially when there are many name-hash collisions)
then there are significant gains to be made using the new algorithm.
Repo Standard Repack With --full-name-hash fluentui 438 MB 168 MB Repo B
6,255 MB 829 MB Repo C 37,737 MB 7,125 MB Repo D 130,049 MB 6,190 MB
The main change in this series is in patch 1, which adds the algorithm and
the option to 'git pack-objects' and 'git repack'. The remaining patches are
focused on creating more evidence around the value of the new name-hash
algorithm and its effects on the packfiles created with it.
I will also try to make clear that I've been focused on client-side
performance and size concerns. I do not know if using this option will have
issues with advanced server-side repacking features, such as delta islands,
reachability bitmaps, or serving clones and fetches from the resulting
packfile. My educated guess is that the name-hash value does not affect
these features in any direct way, but I'll leave the testing of the server
scenarios to the experts.
Thanks, -Stolee
Derrick Stolee (4):
pack-objects: add --full-name-hash option
git-repack: update usage to match docs
p5313: add size comparison test
p5314: add a size test for name-hash collisions
Documentation/git-pack-objects.txt | 3 +-
Documentation/git-repack.txt | 4 +-
Makefile | 1 +
builtin/pack-objects.c | 20 ++++++---
builtin/repack.c | 9 +++-
pack-objects.h | 20 +++++++++
t/helper/test-name-hash.c | 23 ++++++++++
t/helper/test-tool.c | 1 +
t/helper/test-tool.h | 1 +
t/perf/p5313-pack-objects.sh | 71 ++++++++++++++++++++++++++++++
t/perf/p5314-name-hash.sh | 41 +++++++++++++++++
t/t0450/txt-help-mismatches | 1 -
12 files changed, 186 insertions(+), 9 deletions(-)
create mode 100644 t/helper/test-name-hash.c
create mode 100755 t/perf/p5313-pack-objects.sh
create mode 100755 t/perf/p5314-name-hash.sh
base-commit: 4c42d5ff284067fa32837421408bebfef996bf81
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1785%2Fderrickstolee%2Ffull-name-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1785/derrickstolee/full-name-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/1785
--
gitgitgadget
next reply other threads:[~2024-09-09 13:56 UTC|newest]
Thread overview: 38+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-09-09 13:56 Derrick Stolee via GitGitGadget [this message]
2024-09-09 13:56 ` [PATCH 1/4] pack-objects: add --full-name-hash option Derrick Stolee via GitGitGadget
2024-09-09 17:45 ` Junio C Hamano
2024-09-10 2:31 ` Derrick Stolee
2024-09-09 13:56 ` [PATCH 2/4] git-repack: update usage to match docs Derrick Stolee via GitGitGadget
2024-09-09 17:48 ` Junio C Hamano
2024-09-09 13:56 ` [PATCH 3/4] p5313: add size comparison test Derrick Stolee via GitGitGadget
2024-09-09 13:56 ` [PATCH 4/4] p5314: add a size test for name-hash collisions Derrick Stolee via GitGitGadget
2024-09-09 16:07 ` [PATCH 0/4] pack-objects: create new name-hash algorithm Derrick Stolee
2024-09-09 18:06 ` Junio C Hamano
2024-09-10 2:37 ` Derrick Stolee
2024-09-10 14:56 ` Taylor Blau
2024-09-10 21:05 ` Derrick Stolee
2024-09-11 6:32 ` Jeff King
2024-09-10 16:07 ` Junio C Hamano
2024-09-10 20:36 ` Junio C Hamano
2024-09-10 21:09 ` Derrick Stolee
2024-09-18 20:46 ` [PATCH v2 0/6] " Derrick Stolee via GitGitGadget
2024-09-18 20:46 ` [PATCH v2 1/6] pack-objects: add --full-name-hash option Derrick Stolee via GitGitGadget
2024-09-19 21:56 ` Junio C Hamano
2024-09-18 20:46 ` [PATCH v2 2/6] repack: test " Derrick Stolee via GitGitGadget
2024-09-19 22:11 ` Junio C Hamano
2024-09-18 20:46 ` [PATCH v2 3/6] pack-objects: add GIT_TEST_FULL_NAME_HASH Derrick Stolee via GitGitGadget
2024-09-19 22:22 ` Junio C Hamano
2024-09-23 1:39 ` Derrick Stolee
2024-09-18 20:46 ` [PATCH v2 4/6] git-repack: update usage to match docs Derrick Stolee via GitGitGadget
2024-09-18 20:46 ` [PATCH v2 5/6] p5313: add size comparison test Derrick Stolee via GitGitGadget
2024-09-19 21:58 ` Junio C Hamano
2024-09-23 1:50 ` Derrick Stolee
2024-09-23 16:14 ` Junio C Hamano
2024-09-18 20:46 ` [PATCH v2 6/6] test-tool: add helper for name-hash values Derrick Stolee via GitGitGadget
2024-09-24 7:02 ` Patrick Steinhardt
2024-09-25 13:35 ` Derrick Stolee
2024-09-18 23:30 ` [PATCH v2 0/6] pack-objects: create new name-hash algorithm Derrick Stolee
2024-10-04 21:17 ` Junio C Hamano
2024-10-04 22:30 ` Taylor Blau
2024-10-04 22:35 ` Jeff King
2024-10-08 18:32 ` Derrick Stolee
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=pull.1785.git.1725890210.gitgitgadget@gmail.com \
--to=gitgitgadget@gmail.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=johannes.schindelin@gmx.de \
--cc=johncai86@gmail.com \
--cc=me@ttaylorr.com \
--cc=newren@gmail.com \
--cc=peff@peff.net \
--cc=ps@pks.im \
--cc=stolee@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).