All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com>
Cc: git@vger.kernel.org, gitster@pobox.com, vdye@github.com,
	newren@gmail.com, Derrick Stolee <derrickstolee@github.com>
Subject: Re: [PATCH 1/4] hashfile: allow skipping the hash function
Date: Wed, 07 Dec 2022 23:13:15 +0100	[thread overview]
Message-ID: <221207.868rjiam86.gmgdl@evledraar.gmail.com> (raw)
In-Reply-To: <40ee8dbaef06f8f4265d12436455279499d7ac01.1670433958.git.gitgitgadget@gmail.com>


On Wed, Dec 07 2022, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <derrickstolee@github.com>
> [...]
> However, hashing the file contents during write comes at a performance
> penalty. It's slower to hash the bytes on their way to the disk than
> without that step. This problem is made worse by the replacement of
> hardware-accelerated SHA1 computations with the software-based sha1dc
> computation.

More on that lack of HW accel later...

> This write cost is significant

Don't you mean hashing cost, or do we also do additional writes if we do
the hashing?

> , and the checksum capability is likely
> not worth that cost for such a short-lived file. The index is rewritten
> frequently and the only time the checksum is checked is during 'git
> fsck'. Thus, it would be helpful to allow a user to opt-out of the hash
> computation.

I didn't know that, and had assumed that we at least checked it on the
full read (and I found this bit of the commit message after writing the
last paragraphs here at the end, so maybe skipping this is fine...).

> [...]
> @@ -64,7 +65,12 @@ int finalize_hashfile(struct hashfile *f, unsigned char *result,
>  	int fd;
>  
>  	hashflush(f);
> -	the_hash_algo->final_fn(f->buffer, &f->ctx);
> +
> +	if (f->skip_hash)
> +		memset(f->buffer, 0, the_hash_algo->rawsz);

Here you're hardcoding a new version of null_oid(), but we can use it
instead. Perhaps:
	
	diff --git a/csum-file.c b/csum-file.c
	index 3243473c3d7..b54c4f66cbb 100644
	--- a/csum-file.c
	+++ b/csum-file.c
	@@ -63,11 +63,12 @@ int finalize_hashfile(struct hashfile *f, unsigned char *result,
	 		      enum fsync_component component, unsigned int flags)
	 {
	 	int fd;
	+	const struct object_id *const noid = null_oid();
	 
	 	hashflush(f);
	 
	 	if (f->skip_hash)
	-		memset(f->buffer, 0, the_hash_algo->rawsz);
	+		memcpy(f->buffer, noid, sizeof(*noid));
	 	else
	 		the_hash_algo->final_fn(f->buffer, &f->ctx);
	 
> @@ -153,6 +160,7 @@ static struct hashfile *hashfd_internal(int fd, const char *name,
>  	f->tp = tp;
>  	f->name = name;
>  	f->do_crc = 0;
> +	f->skip_hash = 0;
>  	the_hash_algo->init_fn(&f->ctx);
>  
>  	f->buffer_len = buffer_len;

I think I pointed out in the RFC that we'd be much faster with
non-sha1collisiondetection, and that maybe this would get us partway to
the performance you desired (or maybe we'd think that was a more
acceptable trade-off, as it didn't make the format
backwards-incompatible).

But just from seeing "do_crc" here in the context, did you benchmark it
against that? How does it perform?

There's no place to put that in the index, but we *could* encode it in
the hash, just with a lot of leading zeros.

Maybe that would give us some/most of the performance benefits, with the
benefit of a checksum?

Or maybe not, but I think it's worth exploring & supporting a different
& faster SHA-1 implementation before making (even opt-in) backwards
incompatible format changes for performance reasons, and if even that's
too slow maybe crc32 would be sufficient (but not compatible), but still
safer?

  reply	other threads:[~2022-12-07 22:21 UTC|newest]

Thread overview: 59+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-12-07 17:25 [PATCH 0/4] Optionally skip hashing index on write Derrick Stolee via GitGitGadget
2022-12-07 17:25 ` [PATCH 1/4] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget
2022-12-07 22:13   ` Ævar Arnfjörð Bjarmason [this message]
2022-12-08  7:32     ` Jeff King
2022-12-07 17:25 ` [PATCH 2/4] read-cache: add index.skipHash config option Derrick Stolee via GitGitGadget
2022-12-07 18:59   ` Eric Sunshine
2022-12-12 13:59     ` Derrick Stolee
2022-12-12 18:55       ` Eric Sunshine
2022-12-07 22:25   ` Ævar Arnfjörð Bjarmason
2022-12-07 23:06   ` Ævar Arnfjörð Bjarmason
2022-12-08  0:05     ` Junio C Hamano
2022-12-12 14:05     ` Derrick Stolee
2022-12-12 18:01       ` Ævar Arnfjörð Bjarmason
2022-12-07 17:25 ` [PATCH 3/4] test-lib-functions: add helper for trailing hash Derrick Stolee via GitGitGadget
2022-12-07 22:27   ` Ævar Arnfjörð Bjarmason
2022-12-12 14:10     ` Derrick Stolee
2022-12-07 17:25 ` [PATCH 4/4] features: feature.manyFiles implies fast index writes Derrick Stolee via GitGitGadget
2022-12-07 22:30   ` Ævar Arnfjörð Bjarmason
2022-12-12 14:18     ` Derrick Stolee
2022-12-12 18:27       ` Ævar Arnfjörð Bjarmason
2022-12-07 23:27 ` [PATCH 0/4] Optionally skip hashing index on write Junio C Hamano
2022-12-07 23:42   ` Ævar Arnfjörð Bjarmason
2022-12-08 16:38   ` Derrick Stolee
2022-12-12 22:22     ` Jacob Keller
2022-12-12 16:31 ` [PATCH v2 " Derrick Stolee via GitGitGadget
2022-12-12 16:31   ` [PATCH v2 1/4] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget
2022-12-12 16:31   ` [PATCH v2 2/4] read-cache: add index.skipHash config option Derrick Stolee via GitGitGadget
2022-12-12 16:31   ` [PATCH v2 3/4] test-lib-functions: add helper for trailing hash Derrick Stolee via GitGitGadget
2022-12-12 18:14     ` SZEDER Gábor
2022-12-13  0:55       ` Junio C Hamano
2022-12-17 17:37         ` SZEDER Gábor
2022-12-12 16:31   ` [PATCH v2 4/4] features: feature.manyFiles implies fast index writes Derrick Stolee via GitGitGadget
2022-12-15 15:06   ` [PATCH v3 0/4] Optionally skip hashing index on write Derrick Stolee via GitGitGadget
2022-12-15 15:06     ` [PATCH v3 1/4] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget
2022-12-15 15:06     ` [PATCH v3 2/4] read-cache: add index.skipHash config option Derrick Stolee via GitGitGadget
2022-12-15 16:12       ` Ævar Arnfjörð Bjarmason
2022-12-15 15:06     ` [PATCH v3 3/4] test-lib-functions: add helper for trailing hash Derrick Stolee via GitGitGadget
2022-12-15 15:07     ` [PATCH v3 4/4] features: feature.manyFiles implies fast index writes Derrick Stolee via GitGitGadget
2022-12-15 15:56     ` [PATCH v3 0/4] Optionally skip hashing index on write Ævar Arnfjörð Bjarmason
2022-12-16 13:41       ` Derrick Stolee
2022-12-16 15:31     ` [PATCH v4 " Derrick Stolee via GitGitGadget
2022-12-16 15:31       ` [PATCH v4 1/4] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget
2022-12-16 15:31       ` [PATCH v4 2/4] read-cache: add index.skipHash config option Derrick Stolee via GitGitGadget
2022-12-16 15:31       ` [PATCH v4 3/4] test-lib-functions: add helper for trailing hash Derrick Stolee via GitGitGadget
2022-12-16 15:31       ` [PATCH v4 4/4] features: feature.manyFiles implies fast index writes Derrick Stolee via GitGitGadget
2022-12-16 15:43       ` [PATCH v4 0/4] Optionally skip hashing index on write Ævar Arnfjörð Bjarmason
2023-01-06 15:33         ` Derrick Stolee
2023-01-06 22:45           ` Junio C Hamano
2023-01-06 23:40             ` Derrick Stolee
2023-01-09 17:15               ` Ævar Arnfjörð Bjarmason
2023-01-09 18:00                 ` Derrick Stolee
2023-01-09 19:22                   ` Ævar Arnfjörð Bjarmason
2023-01-06 16:31       ` [PATCH v5 " Derrick Stolee via GitGitGadget
2023-01-06 16:31         ` [PATCH v5 1/4] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget
2023-01-06 16:31         ` [PATCH v5 2/4] read-cache: add index.skipHash config option Derrick Stolee via GitGitGadget
2023-01-06 16:31         ` [PATCH v5 3/4] test-lib-functions: add helper for trailing hash Derrick Stolee via GitGitGadget
2023-01-06 16:31         ` [PATCH v5 4/4] features: feature.manyFiles implies fast index writes Derrick Stolee via GitGitGadget
2023-01-15  9:31         ` [PATCH v5 0/4] Optionally skip hashing index on write Junio C Hamano
2023-01-17 14:49           ` Derrick Stolee

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=221207.868rjiam86.gmgdl@evledraar.gmail.com \
    --to=avarab@gmail.com \
    --cc=derrickstolee@github.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=gitster@pobox.com \
    --cc=newren@gmail.com \
    --cc=vdye@github.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.