From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com>
Cc: git@vger.kernel.org, jrnieder@gmail.com,
Derrick Stolee <derrickstolee@github.com>
Subject: Re: [PATCH 02/30] read-cache: add index.computeHash config option
Date: Thu, 17 Nov 2022 17:13:45 +0100 [thread overview]
Message-ID: <221117.8635ahik7e.gmgdl@evledraar.gmail.com> (raw)
In-Reply-To: <030d76f52af654470026b0c4b1dfba2b6c996885.1667846164.git.gitgitgadget@gmail.com>
On Mon, Nov 07 2022, Derrick Stolee via GitGitGadget wrote:
> Summary
> 'without hash' ran
> 1.78 ± 0.76 times faster than 'with hash'
>
> These performance benefits are substantial enough to allow users the
> ability to opt-in to this feature, even with the potential confusion
> with older 'git fsck' versions.
The 0.76 part of that is probably just fs caches etc. screwing things
up. I tried it on a ramdisk with CFLAGS=-O3:
$ hyperfine -L v false,true './git -c index.computeHash={v} -C /dev/shm/linux update-index --force-write' -w 1 -r 10
Benchmark 1: ./git -c index.computeHash=false -C /dev/shm/linux update-index --force-write
Time (mean ± σ): 13.3 ms ± 0.3 ms [User: 7.1 ms, System: 6.1 ms]
Range (min … max): 12.7 ms … 13.6 ms 10 runs
Benchmark 2: ./git -c index.computeHash=true -C /dev/shm/linux update-index --force-write
Time (mean ± σ): 34.8 ms ± 0.4 ms [User: 28.9 ms, System: 5.8 ms]
Range (min … max): 34.2 ms … 35.1 ms 10 runs
Summary
'./git -c index.computeHash=false -C /dev/shm/linux update-index --force-write' ran
2.62 ± 0.07 times faster than './git -c index.computeHash=true -C /dev/shm/linux update-index --force-write'
I also see that if I compile with OPENSSL_SHA1=Y, then:
$ hyperfine -L v false,true './git -c index.computeHash={v} -C /dev/shm/linux update-index --force-write'
Benchmark 1: ./git -c index.computeHash=false -C /dev/shm/linux update-index --force-write
Time (mean ± σ): 14.0 ms ± 1.3 ms [User: 7.7 ms, System: 6.2 ms]
Range (min … max): 13.1 ms … 21.7 ms 206 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare'
options.
Benchmark 2: ./git -c index.computeHash=true -C /dev/shm/linux update-index --force-write
Time (mean ± σ): 21.0 ms ± 1.0 ms [User: 15.0 ms, System: 6.0 ms]
Range (min … max): 20.1 ms … 28.4 ms 138 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare'
options.
Summary
'./git -c index.computeHash=false -C /dev/shm/linux update-index --force-write' ran
1.50 ± 0.15 times faster than './git -c index.computeHash=true -C /dev/shm/linux update-index --force-write'
Which, FWIW is something worth considering. I.e. when we introduced
sha1dc we did so with the "big hammer" of the existing hashing API,
which is all or nothing, and we pick the hash when we compile git.
But that left a lot of things slower for no good reason, e.g. when we do
this hashing of the trailers. So if we could just compile with two
implementations, and give users the choice of "use the faster hash when
you're not communicating with other git repos" we could make things
faster in some cases, without the potential format interop issues.
> From: Derrick Stolee <derrickstolee@github.com>
> [...]
> +index.computeHash::
> + When enabled, compute the hash of the index file as it is written
> + and store the hash at the end of the content. This is enabled by
> + default.
> ++
If we have a boolean option it makes sense to make its name reflect the
opt-in nature. So "index.skipHash". Then just say "If enabled", and skip
the "this is enabled by default, and then later this code:
> + int compute_hash;
> [...]
> + if (!git_config_get_maybe_bool("index.computehash", &compute_hash) &&
> + !compute_hash)
> + f->skip_hash = 1;
Can just become:
git_config_get_maybe_bool("index.skipHash", &f->skip_hash);
I.e. git_config_get_maybe_bool() leaves the passed-in dest value alone
if it doesn't have it in the config, and you only use this
"compute_hash" as an inverted version of "skip_hash".
> +If you disable `index.computHash`, then older Git clients may report that
> +your index is corrupt during `git fsck`.
> diff --git a/read-cache.c b/read-cache.c
> index 32024029274..f24d96de4d3 100644
> --- a/read-cache.c
> +++ b/read-cache.c
> @@ -1817,6 +1817,8 @@ static int verify_hdr(const struct cache_header *hdr, unsigned long size)
> git_hash_ctx c;
> unsigned char hash[GIT_MAX_RAWSZ];
> int hdr_version;
> + int all_zeroes = 1;
> + unsigned char *start, *end;
>
> if (hdr->hdr_signature != htonl(CACHE_SIGNATURE))
> return error(_("bad signature 0x%08x"), hdr->hdr_signature);
> @@ -1827,10 +1829,23 @@ static int verify_hdr(const struct cache_header *hdr, unsigned long size)
> if (!verify_index_checksum)
> return 0;
>
> + end = (unsigned char *)hdr + size;
> + start = end - the_hash_algo->rawsz;
> + while (start < end) {
> + if (*start != 0) {
> + all_zeroes = 0;
> + break;
> + }
> + start++;
> + }
Didn't you just re-invent oidread()? :)
Just to narrate my way through this. Before we called verify_hdr() we
did:
hdr = (const struct cache_header *)mmap;
if (verify_hdr(hdr, mmap_size) < 0)
So, we mmap()'d the index on disk, and whe "hdr" is the struct version
of this data, we then cast that back to an "unsigned char *" here,
because we're interested in just the raw bytes.
Then we "jump to the end" here, and start iterating over the rawsz at
the end, because we're just reading if we have a null_oid().
Then, right after that verify_hdr() call, the veriy next thing we'll do is:
oidread(&istate->oid, (const unsigned char *)hdr + mmap_size - the_hash_algo->rawsz);
So, maybe I'm missing some subtlety still, and some of this is existing
baggage in the pre-image (we used to have the sha1 in the struct, a
*long* time ago).
But isn't this equivalent?:
diff --git a/read-cache.c b/read-cache.c
index f24d96de4d3..39b5b8419f5 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -1812,13 +1812,14 @@ int verify_index_checksum;
/* Allow fsck to force verification of the cache entry order. */
int verify_ce_order;
-static int verify_hdr(const struct cache_header *hdr, unsigned long size)
+static int verify_hdr(const char *const mmap, const size_t size,
+ const struct cache_header **hdrp, struct object_id *oid)
{
+ const struct cache_header *hdr = (const struct cache_header *)mmap;
git_hash_ctx c;
unsigned char hash[GIT_MAX_RAWSZ];
int hdr_version;
- int all_zeroes = 1;
- unsigned char *start, *end;
+ const unsigned char *end = (unsigned char *)mmap + size;
if (hdr->hdr_signature != htonl(CACHE_SIGNATURE))
return error(_("bad signature 0x%08x"), hdr->hdr_signature);
@@ -1826,20 +1827,12 @@ static int verify_hdr(const struct cache_header *hdr, unsigned long size)
if (hdr_version < INDEX_FORMAT_LB || INDEX_FORMAT_UB < hdr_version)
return error(_("bad index version %d"), hdr_version);
+ *hdrp = hdr;
+ oidread(oid, end - the_hash_algo->rawsz);
+
if (!verify_index_checksum)
return 0;
-
- end = (unsigned char *)hdr + size;
- start = end - the_hash_algo->rawsz;
- while (start < end) {
- if (*start != 0) {
- all_zeroes = 0;
- break;
- }
- start++;
- }
-
- if (all_zeroes)
+ if (is_null_oid(oid))
return 0;
the_hash_algo->init_fn(&c);
@@ -2358,11 +2351,8 @@ int do_read_index(struct index_state *istate, const char *path, int must_exist)
mmap_os_err());
close(fd);
- hdr = (const struct cache_header *)mmap;
- if (verify_hdr(hdr, mmap_size) < 0)
+ if (verify_hdr(mmap, mmap_size, &hdr, &istate->oid) < 0)
goto unmap;
-
- oidread(&istate->oid, (const unsigned char *)hdr + mmap_size - the_hash_algo->rawsz);
istate->version = ntohl(hdr->hdr_version);
istate->cache_nr = ntohl(hdr->hdr_entries);
istate->cache_alloc = alloc_nr(istate->cache_nr);
I.e. we just make the verify function be in charge of populating our
"oid", which we can do that early, as we'd error out later in the
function if it doesn't match.
We could avoid the "hdrp" there, but if we're doing the cast it's
probably good for readability to just do it once.
> +test_expect_success 'index.computeHash config option' '
> + (
> + rm -f .git/index &&
> + git -c index.computeHash=false add a &&
> + git fsck
> + )
> +'
You can skip the subshell here, but for a non-RFC let's leave the test
in a nice state for the next test someone adds, so maybe:
test_when_finished "rm -rf repo" &&
git clone . repo &&
[...]
Lastly, on this again:
> These performance benefits are substantial enough to allow users the
> ability to opt-in to this feature, even with the potential confusion
> with older 'git fsck' versions.
Isn't an unstated major caveat here that it's not "an older verison",
but if you on *your version* set the config to "true" your index doesn't
have a hash, so it's persisted until you wipe the index?
next prev parent reply other threads:[~2022-11-17 17:07 UTC|newest]
Thread overview: 56+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-11-07 18:35 [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 01/30] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 02/30] read-cache: add index.computeHash config option Derrick Stolee via GitGitGadget
2022-11-11 23:31 ` Elijah Newren
2022-11-14 16:30 ` Derrick Stolee
2022-11-17 16:13 ` Ævar Arnfjörð Bjarmason [this message]
2022-11-07 18:35 ` [PATCH 03/30] extensions: add refFormat extension Derrick Stolee via GitGitGadget
2022-11-11 23:39 ` Elijah Newren
2022-11-16 14:37 ` Derrick Stolee
2022-11-07 18:35 ` [PATCH 04/30] config: fix multi-level bulleted list Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 05/30] repository: wire ref extensions to ref backends Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 06/30] refs: allow loose files without packed-refs Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 07/30] chunk-format: number of chunks is optional Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 08/30] chunk-format: document trailing table of contents Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 09/30] chunk-format: store chunk offset during write Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 10/30] chunk-format: allow trailing table of contents Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 11/30] chunk-format: parse " Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 12/30] refs: extract packfile format to new file Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 13/30] packed-backend: extract add_write_error() Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 14/30] packed-backend: extract iterator/updates merge Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 15/30] packed-backend: create abstraction for writing refs Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 16/30] config: add config values for packed-refs v2 Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 17/30] packed-backend: create shell of v2 writes Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 18/30] packed-refs: write file format version 2 Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 19/30] packed-refs: read file format v2 Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 20/30] packed-refs: read optional prefix chunks Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 21/30] packed-refs: write " Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 22/30] packed-backend: create GIT_TEST_PACKED_REFS_VERSION Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 23/30] t1409: test with packed-refs v2 Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 24/30] t5312: allow packed-refs v2 format Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 25/30] t5502: add PACKED_REFS_V1 prerequisite Derrick Stolee via GitGitGadget
2022-11-07 18:36 ` [PATCH 26/30] t3210: require packed-refs v1 for some tests Derrick Stolee via GitGitGadget
2022-11-07 18:36 ` [PATCH 27/30] t*: skip packed-refs v2 over http tests Derrick Stolee via GitGitGadget
2022-11-07 18:36 ` [PATCH 28/30] ci: run GIT_TEST_PACKED_REFS_VERSION=2 in some builds Derrick Stolee via GitGitGadget
2022-11-07 18:36 ` [PATCH 29/30] p1401: create performance test for ref operations Derrick Stolee via GitGitGadget
2022-11-07 18:36 ` [PATCH 30/30] refs: skip hashing when writing packed-refs v2 Derrick Stolee via GitGitGadget
2022-11-09 15:15 ` [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format Derrick Stolee
2022-11-11 23:28 ` Elijah Newren
2022-11-14 0:07 ` Derrick Stolee
2022-11-15 2:47 ` Elijah Newren
2022-11-16 14:45 ` Derrick Stolee
2022-11-17 4:28 ` Elijah Newren
2022-11-18 23:31 ` Junio C Hamano
2022-11-19 0:41 ` Elijah Newren
2022-11-19 3:00 ` Taylor Blau
2022-11-30 15:31 ` Derrick Stolee
2022-11-28 18:56 ` Han-Wen Nienhuys
2022-11-30 15:16 ` Derrick Stolee
2022-11-30 15:38 ` Phillip Wood
2022-11-30 16:37 ` Taylor Blau
2022-11-30 18:30 ` Han-Wen Nienhuys
2022-11-30 18:37 ` Sean Allred
2022-12-01 20:18 ` Derrick Stolee
2022-12-02 16:46 ` Han-Wen Nienhuys
2022-12-02 18:24 ` Ævar Arnfjörð Bjarmason
2022-11-30 22:55 ` Junio C Hamano
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=221117.8635ahik7e.gmgdl@evledraar.gmail.com \
--to=avarab@gmail.com \
--cc=derrickstolee@github.com \
--cc=git@vger.kernel.org \
--cc=gitgitgadget@gmail.com \
--cc=jrnieder@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).