[PATCH 0/6] hash: introduce unsafe_hash_algo(), drop unsafe

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/6] hash: introduce unsafe_hash_algo(), drop unsafe_ variants
@ 2024-11-20 19:13 Taylor Blau
  2024-11-20 19:13 ` [PATCH 1/6] csum-file: store the hash algorithm as a struct field Taylor Blau
                   ` (8 more replies)
  0 siblings, 9 replies; 60+ messages in thread
From: Taylor Blau @ 2024-11-20 19:13 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Jeff King, Junio C Hamano, brian m. carlson

(This series is based on my 'tb/sha1-unsafe-helper', which I sent to the
list here[1].)

This series implements an idea discussed in [2] which suggests that we
introduce a way to access a wrapped version of a 'struct git_hash_algo'
which represents the unsafe variant of that algorithm, rather than
having individual unsafe_ functions (like unsafe_init_fn() versus
init_fn(), etc.).

This approach is relatively straightforward to implement, and removes a
significant deficiency in the original implementation of
unsafe/non-cryptographic hash functions by making it impossible to
switch between safe- and unsafe variants of hash functions. It also
cleans up the sha1-unsafe test helper's implementation by removing a
large number of "if (unsafe)"-style conditionals.

The series is laid out as follows:

  * The first two patches prepare the hashfile API for the upcoming
    change:

      csum-file: store the hash algorithm as a struct field
      csum-file.c: extract algop from hashfile_checksum_valid()

  * The next patch implements the new 'unsafe_hash_algo()' function at
    the heart of this series' approach:

      hash.h: introduce `unsafe_hash_algo()`

  * The next two patches convert existing callers to use the new
    'unsafe_hash_algo()' function, instead of switching between safe and
    unsafe_ variants of individual functions:

      csum-file.c: use unsafe_hash_algo()
      t/helper/test-hash.c: use unsafe_hash_algo()

  * The final patch drops the unsafe_ function variants following all
    callers being converted to use the new pattern:

      hash.h: drop unsafe_ function variants

Thanks in advance for your review!

[1]: https://lore.kernel.org/git/cover.1730833506.git.me@ttaylorr.com/
[2]: https://lore.kernel.org/git/20241107013915.GA961214@coredump.intra.peff.net/

Taylor Blau (6):
  csum-file: store the hash algorithm as a struct field
  csum-file.c: extract algop from hashfile_checksum_valid()
  hash.h: introduce `unsafe_hash_algo()`
  csum-file.c: use unsafe_hash_algo()
  t/helper/test-hash.c: use unsafe_hash_algo()
  hash.h: drop unsafe_ function variants

 csum-file.c          | 33 ++++++++++++++++++---------------
 csum-file.h          |  1 +
 hash.h               | 20 +++++---------------
 object-file.c        | 41 ++++++++++++++++++++++++++---------------
 t/helper/test-hash.c | 17 +++++------------
 5 files changed, 55 insertions(+), 57 deletions(-)


base-commit: d8c1fc78b57e02a140b5c363caaa14c2dc2bb274
-- 
2.47.0.237.gc601277f4c4

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 1/6] csum-file: store the hash algorithm as a struct field
  2024-11-20 19:13 [PATCH 0/6] hash: introduce unsafe_hash_algo(), drop unsafe_ variants Taylor Blau
@ 2024-11-20 19:13 ` Taylor Blau
  2024-11-21  9:18   ` Jeff King
  2024-11-20 19:13 ` [PATCH 2/6] csum-file.c: extract algop from hashfile_checksum_valid() Taylor Blau
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 60+ messages in thread
From: Taylor Blau @ 2024-11-20 19:13 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Jeff King, Junio C Hamano, brian m. carlson

Throughout the hashfile API, we rely on a reference to 'the_hash_algo',
and call its _usnafe function variants directly.

Prepare for a future change where we may use a different 'git_hash_algo'
pointer (instead of just relying on 'the_hash_algo' throughout) by
making the 'git_hash_algo' pointer a member of the 'hashfile' structure
itself.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 csum-file.c | 20 +++++++++++---------
 csum-file.h |  1 +
 2 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/csum-file.c b/csum-file.c
index c203ebf11b3..62f4965f094 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -50,7 +50,7 @@ void hashflush(struct hashfile *f)
 
 	if (offset) {
 		if (!f->skip_hash)
-			the_hash_algo->unsafe_update_fn(&f->ctx, f->buffer, offset);
+			f->algop->unsafe_update_fn(&f->ctx, f->buffer, offset);
 		flush(f, f->buffer, offset);
 		f->offset = 0;
 	}
@@ -71,14 +71,14 @@ int finalize_hashfile(struct hashfile *f, unsigned char *result,
 	hashflush(f);
 
 	if (f->skip_hash)
-		hashclr(f->buffer, the_repository->hash_algo);
+		hashclr(f->buffer, f->algop);
 	else
-		the_hash_algo->unsafe_final_fn(f->buffer, &f->ctx);
+		f->algop->unsafe_final_fn(f->buffer, &f->ctx);
 
 	if (result)
-		hashcpy(result, f->buffer, the_repository->hash_algo);
+		hashcpy(result, f->buffer, f->algop);
 	if (flags & CSUM_HASH_IN_STREAM)
-		flush(f, f->buffer, the_hash_algo->rawsz);
+		flush(f, f->buffer, f->algop->rawsz);
 	if (flags & CSUM_FSYNC)
 		fsync_component_or_die(component, f->fd, f->name);
 	if (flags & CSUM_CLOSE) {
@@ -128,7 +128,7 @@ void hashwrite(struct hashfile *f, const void *buf, unsigned int count)
 			 * f->offset is necessarily zero.
 			 */
 			if (!f->skip_hash)
-				the_hash_algo->unsafe_update_fn(&f->ctx, buf, nr);
+				f->algop->unsafe_update_fn(&f->ctx, buf, nr);
 			flush(f, buf, nr);
 		} else {
 			/*
@@ -174,7 +174,9 @@ static struct hashfile *hashfd_internal(int fd, const char *name,
 	f->name = name;
 	f->do_crc = 0;
 	f->skip_hash = 0;
-	the_hash_algo->unsafe_init_fn(&f->ctx);
+
+	f->algop = the_hash_algo;
+	f->algop->unsafe_init_fn(&f->ctx);
 
 	f->buffer_len = buffer_len;
 	f->buffer = xmalloc(buffer_len);
@@ -208,7 +210,7 @@ void hashfile_checkpoint(struct hashfile *f, struct hashfile_checkpoint *checkpo
 {
 	hashflush(f);
 	checkpoint->offset = f->total;
-	the_hash_algo->unsafe_clone_fn(&checkpoint->ctx, &f->ctx);
+	f->algop->unsafe_clone_fn(&checkpoint->ctx, &f->ctx);
 }
 
 int hashfile_truncate(struct hashfile *f, struct hashfile_checkpoint *checkpoint)
@@ -219,7 +221,7 @@ int hashfile_truncate(struct hashfile *f, struct hashfile_checkpoint *checkpoint
 	    lseek(f->fd, offset, SEEK_SET) != offset)
 		return -1;
 	f->total = offset;
-	the_hash_algo->unsafe_clone_fn(&f->ctx, &checkpoint->ctx);
+	f->algop->unsafe_clone_fn(&f->ctx, &checkpoint->ctx);
 	f->offset = 0; /* hashflush() was called in checkpoint */
 	return 0;
 }
diff --git a/csum-file.h b/csum-file.h
index 7c73da0a40a..2b45f4673a2 100644
--- a/csum-file.h
+++ b/csum-file.h
@@ -20,6 +20,7 @@ struct hashfile {
 	size_t buffer_len;
 	unsigned char *buffer;
 	unsigned char *check_buffer;
+	const struct git_hash_algo *algop;
 
 	/**
 	 * If non-zero, skip_hash indicates that we should
-- 
2.47.0.237.gc601277f4c4


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/6] csum-file: store the hash algorithm as a struct field
  2024-11-20 19:13 ` [PATCH 1/6] csum-file: store the hash algorithm as a struct field Taylor Blau
@ 2024-11-21  9:18   ` Jeff King
  0 siblings, 0 replies; 60+ messages in thread
From: Jeff King @ 2024-11-21  9:18 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Elijah Newren, Junio C Hamano, brian m. carlson

On Wed, Nov 20, 2024 at 02:13:44PM -0500, Taylor Blau wrote:

> Throughout the hashfile API, we rely on a reference to 'the_hash_algo',
> and call its _usnafe function variants directly.

s/usnafe/unsafe/

> Prepare for a future change where we may use a different 'git_hash_algo'
> pointer (instead of just relying on 'the_hash_algo' throughout) by
> making the 'git_hash_algo' pointer a member of the 'hashfile' structure
> itself.

This makes sense in the context of your series, which will eventually
just hold the unsafe algo struct.

But I also think it is a good step in general, because it means that
rather than referring to the_hash_algo all over the place, we only do it
when initializing the hashfile struct. In the long run, that should take
a repository struct or a git_hash_algo pointer directly, and we'd want
to pass it just once to hashfd(), etc.

-Peff

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 2/6] csum-file.c: extract algop from hashfile_checksum_valid()
  2024-11-20 19:13 [PATCH 0/6] hash: introduce unsafe_hash_algo(), drop unsafe_ variants Taylor Blau
  2024-11-20 19:13 ` [PATCH 1/6] csum-file: store the hash algorithm as a struct field Taylor Blau
@ 2024-11-20 19:13 ` Taylor Blau
  2024-11-20 19:13 ` [PATCH 3/6] hash.h: introduce `unsafe_hash_algo()` Taylor Blau
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 60+ messages in thread
From: Taylor Blau @ 2024-11-20 19:13 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Jeff King, Junio C Hamano, brian m. carlson

Perform a similar transformation as in the previous commit, but focused
instead on hashfile_checksum_valid(). This function does not work with a
hashfile structure itself, and instead validates the raw contents of a
file written using the hashfile API.

We'll want to be prepared for a similar change to this function in the
future, so prepare ourselves for that by extracting 'the_hash_algo' into
its own field for use within this function.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 csum-file.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/csum-file.c b/csum-file.c
index 62f4965f094..107bc4c96f9 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -242,14 +242,15 @@ int hashfile_checksum_valid(const unsigned char *data, size_t total_len)
 {
 	unsigned char got[GIT_MAX_RAWSZ];
 	git_hash_ctx ctx;
-	size_t data_len = total_len - the_hash_algo->rawsz;
+	const struct git_hash_algo *algop = the_hash_algo;
+	size_t data_len = total_len - algop->rawsz;
 
-	if (total_len < the_hash_algo->rawsz)
+	if (total_len < algop->rawsz)
 		return 0; /* say "too short"? */
 
-	the_hash_algo->unsafe_init_fn(&ctx);
-	the_hash_algo->unsafe_update_fn(&ctx, data, data_len);
-	the_hash_algo->unsafe_final_fn(got, &ctx);
+	algop->unsafe_init_fn(&ctx);
+	algop->unsafe_update_fn(&ctx, data, data_len);
+	algop->unsafe_final_fn(got, &ctx);
 
-	return hasheq(got, data + data_len, the_repository->hash_algo);
+	return hasheq(got, data + data_len, algop);
 }
-- 
2.47.0.237.gc601277f4c4


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 3/6] hash.h: introduce `unsafe_hash_algo()`
  2024-11-20 19:13 [PATCH 0/6] hash: introduce unsafe_hash_algo(), drop unsafe_ variants Taylor Blau
  2024-11-20 19:13 ` [PATCH 1/6] csum-file: store the hash algorithm as a struct field Taylor Blau
  2024-11-20 19:13 ` [PATCH 2/6] csum-file.c: extract algop from hashfile_checksum_valid() Taylor Blau
@ 2024-11-20 19:13 ` Taylor Blau
  2024-11-21  9:37   ` Jeff King
  2024-11-20 19:13 ` [PATCH 4/6] csum-file.c: use unsafe_hash_algo() Taylor Blau
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 60+ messages in thread
From: Taylor Blau @ 2024-11-20 19:13 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Jeff King, Junio C Hamano, brian m. carlson

In 253ed9ecff (hash.h: scaffolding for _unsafe hashing variants,
2024-09-26), we introduced "unsafe" variants of the SHA-1 hashing
functions by introducing new functions like "unsafe_init_fn()" and so
on.

This approach has a major shortcoming that callers must remember to
consistently use one variant or the other. Failing to consistently use
(or not use) the unsafe variants can lead to crashes at best, or subtle
memory corruption issues at worst.

In the hashfile API, this isn't difficult to achieve, but verifying that
all callers consistently use the unsafe variants is somewhat of a chore
given how spread out all of the callers are. In the sha1 and sha1-unsafe
test helpers, all of the calls to various hash functions are guarded by
an "if (unsafe)" conditional, which is repetitive and cumbersome.

Address these issues by introducing a new pattern whereby one
'git_hash_algo' can return a pointer to another 'git_hash_algo' that
represents the unsafe version of itself. So instead of having something
like:

    if (unsafe)
      the_hash_algo->init_fn(...);
      the_hash_algo->update_fn(...);
      the_hash_algo->final_fn(...);
    else
      the_hash_algo->unsafe_init_fn(...);
      the_hash_algo->unsafe_update_fn(...);
      the_hash_algo->unsafe_final_fn(...);

we can instead write:

    struct git_hash_algo *algop = the_hash_algo;
    if (unsafe)
      algop = unsafe_hash_algo(algop);

    the_hash_algo->init_fn(...);
    the_hash_algo->update_fn(...);
    the_hash_algo->final_fn(...);

This removes the existing shortcoming by no longer forcing the caller to
"remember" which variant of the hash functions it wants to call, only to
hold onto a 'struct git_hash_algo' pointer that is initialized once.

Similarly, while there currently is still a way to "mix" safe and unsafe
functions, this too will go away after subsequent commits remove all
direct calls to the unsafe_ variants.

Suggested-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 hash.h        |  5 +++++
 object-file.c | 26 ++++++++++++++++++++++++++
 2 files changed, 31 insertions(+)

diff --git a/hash.h b/hash.h
index 756166ce5e8..23cf6876e50 100644
--- a/hash.h
+++ b/hash.h
@@ -305,6 +305,9 @@ struct git_hash_algo {
 
 	/* The all-zeros OID. */
 	const struct object_id *null_oid;
+
+	/* The unsafe variant of this hash function, if one exists. */
+	const struct git_hash_algo *unsafe;
 };
 extern const struct git_hash_algo hash_algos[GIT_HASH_NALGOS];
 
@@ -323,6 +326,8 @@ static inline int hash_algo_by_ptr(const struct git_hash_algo *p)
 	return p - hash_algos;
 }
 
+const struct git_hash_algo *unsafe_hash_algo(const struct git_hash_algo *algop);
+
 const struct object_id *null_oid(void);
 
 static inline int hashcmp(const unsigned char *sha1, const unsigned char *sha2, const struct git_hash_algo *algop)
diff --git a/object-file.c b/object-file.c
index b1a3463852c..fddcdbe9ba6 100644
--- a/object-file.c
+++ b/object-file.c
@@ -204,6 +204,22 @@ static void git_hash_unknown_final_oid(struct object_id *oid UNUSED,
 	BUG("trying to finalize unknown hash");
 }
 
+static const struct git_hash_algo sha1_unsafe_algo = {
+	.name = "sha1",
+	.format_id = GIT_SHA1_FORMAT_ID,
+	.rawsz = GIT_SHA1_RAWSZ,
+	.hexsz = GIT_SHA1_HEXSZ,
+	.blksz = GIT_SHA1_BLKSZ,
+	.init_fn = git_hash_sha1_init_unsafe,
+	.clone_fn = git_hash_sha1_clone_unsafe,
+	.update_fn = git_hash_sha1_update_unsafe,
+	.final_fn = git_hash_sha1_final_unsafe,
+	.final_oid_fn = git_hash_sha1_final_oid_unsafe,
+	.empty_tree = &empty_tree_oid,
+	.empty_blob = &empty_blob_oid,
+	.null_oid = &null_oid_sha1,
+};
+
 const struct git_hash_algo hash_algos[GIT_HASH_NALGOS] = {
 	{
 		.name = NULL,
@@ -241,6 +257,7 @@ const struct git_hash_algo hash_algos[GIT_HASH_NALGOS] = {
 		.unsafe_update_fn = git_hash_sha1_update_unsafe,
 		.unsafe_final_fn = git_hash_sha1_final_unsafe,
 		.unsafe_final_oid_fn = git_hash_sha1_final_oid_unsafe,
+		.unsafe = &sha1_unsafe_algo,
 		.empty_tree = &empty_tree_oid,
 		.empty_blob = &empty_blob_oid,
 		.null_oid = &null_oid_sha1,
@@ -307,6 +324,15 @@ int hash_algo_by_length(int len)
 	return GIT_HASH_UNKNOWN;
 }
 
+const struct git_hash_algo *unsafe_hash_algo(const struct git_hash_algo *algop)
+{
+	/* If we have a faster "unsafe" implementation, use that. */
+	if (algop->unsafe)
+		return algop->unsafe;
+	/* Otherwise use the default one. */
+	return algop;
+}
+
 /*
  * This is meant to hold a *small* number of objects that you would
  * want repo_read_object_file() to be able to return, but yet you do not want
-- 
2.47.0.237.gc601277f4c4


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/6] hash.h: introduce `unsafe_hash_algo()`
  2024-11-20 19:13 ` [PATCH 3/6] hash.h: introduce `unsafe_hash_algo()` Taylor Blau
@ 2024-11-21  9:37   ` Jeff King
  2024-11-22  0:39     ` brian m. carlson
  2025-01-10 21:38     ` Taylor Blau
  0 siblings, 2 replies; 60+ messages in thread
From: Jeff King @ 2024-11-21  9:37 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Elijah Newren, Junio C Hamano, brian m. carlson

On Wed, Nov 20, 2024 at 02:13:50PM -0500, Taylor Blau wrote:

> +static const struct git_hash_algo sha1_unsafe_algo = {
> +	.name = "sha1",
> +	.format_id = GIT_SHA1_FORMAT_ID,
> +	.rawsz = GIT_SHA1_RAWSZ,
> +	.hexsz = GIT_SHA1_HEXSZ,
> +	.blksz = GIT_SHA1_BLKSZ,
> +	.init_fn = git_hash_sha1_init_unsafe,
> +	.clone_fn = git_hash_sha1_clone_unsafe,
> +	.update_fn = git_hash_sha1_update_unsafe,
> +	.final_fn = git_hash_sha1_final_unsafe,
> +	.final_oid_fn = git_hash_sha1_final_oid_unsafe,
> +	.empty_tree = &empty_tree_oid,
> +	.empty_blob = &empty_blob_oid,
> +	.null_oid = &null_oid_sha1,
> +};

All of the non-function fields here naturally must match the ones in the
parent algo struct, or chaos ensues. That's a little fragile, but it's
not like we're adding new algorithm variants a lot. The biggest risk, I
guess, would be adding a new field to git_hash_algo which defaults to
zero-initialization here. But again, there are only three total and they
are defined near each other here, so I don't think it's a big risk
overall.

I think this struct is a potential landmine for hash_algo_by_ptr():

  static inline int hash_algo_by_ptr(const struct git_hash_algo *p)
  {
          return p - hash_algos;
  }

It's undefined behavior to pass in sha1_unsafe_algo to this function
(but the compiler would not complain since the types are the same). I
don't find it incredibly likely that somebody would want to do that on
an unsafe variant, but I'm not thrilled about leaving that wart for
somebody to find.

If we don't care about the speed of this function, then an
implementation like:

  for (i = 0; i < GIT_HASH_NALGOS; i++) {
	if (p == &hash_algos[i] || p == hash_algos[i]->unsafe)
		return i;
  }
  return GIT_HASH_UNKNOWN;

would work. I'm not sure if that would be measurable. I was surprised at
the number of places that hash_algo_by_ptr() is called. Many low-level
oid functions need it because we store the integer id there rather than
a direct pointer (so oidread(), oidclr(), oid_object_info_extended(),
and so on). But I'd also expect the loop above to be pretty fast. So I
dunno.

>  const struct git_hash_algo hash_algos[GIT_HASH_NALGOS] = {
>  	{
>  		.name = NULL,
> @@ -241,6 +257,7 @@ const struct git_hash_algo hash_algos[GIT_HASH_NALGOS] = {
>  		.unsafe_update_fn = git_hash_sha1_update_unsafe,
>  		.unsafe_final_fn = git_hash_sha1_final_unsafe,
>  		.unsafe_final_oid_fn = git_hash_sha1_final_oid_unsafe,
> +		.unsafe = &sha1_unsafe_algo,
>  		.empty_tree = &empty_tree_oid,
>  		.empty_blob = &empty_blob_oid,
>  		.null_oid = &null_oid_sha1,

OK, and we can leave the sha256 pointer zero-initialized, since the
function handles that at runtime. Good.

-Peff

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/6] hash.h: introduce `unsafe_hash_algo()`
  2024-11-21  9:37   ` Jeff King
@ 2024-11-22  0:39     ` brian m. carlson
  2024-11-22  8:25       ` Jeff King
  2025-01-10 21:38     ` Taylor Blau
  1 sibling, 1 reply; 60+ messages in thread
From: brian m. carlson @ 2024-11-22  0:39 UTC (permalink / raw)
  To: Jeff King; +Cc: Taylor Blau, git, Elijah Newren, Junio C Hamano

[-- Attachment #1: Type: text/plain, Size: 3624 bytes --]

On 2024-11-21 at 09:37:31, Jeff King wrote:
> On Wed, Nov 20, 2024 at 02:13:50PM -0500, Taylor Blau wrote:
> 
> > +static const struct git_hash_algo sha1_unsafe_algo = {
> > +	.name = "sha1",
> > +	.format_id = GIT_SHA1_FORMAT_ID,
> > +	.rawsz = GIT_SHA1_RAWSZ,
> > +	.hexsz = GIT_SHA1_HEXSZ,
> > +	.blksz = GIT_SHA1_BLKSZ,
> > +	.init_fn = git_hash_sha1_init_unsafe,
> > +	.clone_fn = git_hash_sha1_clone_unsafe,
> > +	.update_fn = git_hash_sha1_update_unsafe,
> > +	.final_fn = git_hash_sha1_final_unsafe,
> > +	.final_oid_fn = git_hash_sha1_final_oid_unsafe,
> > +	.empty_tree = &empty_tree_oid,
> > +	.empty_blob = &empty_blob_oid,
> > +	.null_oid = &null_oid_sha1,
> > +};
> 
> All of the non-function fields here naturally must match the ones in the
> parent algo struct, or chaos ensues. That's a little fragile, but it's
> not like we're adding new algorithm variants a lot. The biggest risk, I
> guess, would be adding a new field to git_hash_algo which defaults to
> zero-initialization here. But again, there are only three total and they
> are defined near each other here, so I don't think it's a big risk
> overall.
> 
> I think this struct is a potential landmine for hash_algo_by_ptr():
> 
>   static inline int hash_algo_by_ptr(const struct git_hash_algo *p)
>   {
>           return p - hash_algos;
>   }
> 
> It's undefined behavior to pass in sha1_unsafe_algo to this function
> (but the compiler would not complain since the types are the same). I
> don't find it incredibly likely that somebody would want to do that on
> an unsafe variant, but I'm not thrilled about leaving that wart for
> somebody to find.
> 
> If we don't care about the speed of this function, then an
> implementation like:
> 
>   for (i = 0; i < GIT_HASH_NALGOS; i++) {
> 	if (p == &hash_algos[i] || p == hash_algos[i]->unsafe)
> 		return i;
>   }
>   return GIT_HASH_UNKNOWN;
> 
> would work. I'm not sure if that would be measurable. I was surprised at
> the number of places that hash_algo_by_ptr() is called. Many low-level
> oid functions need it because we store the integer id there rather than
> a direct pointer (so oidread(), oidclr(), oid_object_info_extended(),
> and so on). But I'd also expect the loop above to be pretty fast. So I
> dunno.

Yeah, I'm also a little nervous about this change with hash_algo_by_ptr.
I think we had discussed in the other thread about doing something like
this:

struct git_hash_algo_fns {
	/* The hash initialization function. */
	git_hash_init_fn init_fn;

	/* The hash context cloning function. */
	git_hash_clone_fn clone_fn;

	/* The hash update function. */
	git_hash_update_fn update_fn;

	/* The hash finalization function. */
	git_hash_final_fn final_fn;

	/* The hash finalization function for object IDs. */
	git_hash_final_oid_fn final_oid_fn;
};

and then doing this:

struct git_hash_algo {
	[...]

  struct git_hash_algo_fns fns;
  struct git_hash_algo_fns unsafe_fns;

  [...]
};

That would mean that we'd have to deal with two pointers in your first
patch, but I don't really think that's too terrible.  And, since this
approach doesn't result in an extra struct git_hash_algo, it's
impossible to misuse hash_algo_by_ptr.

I'm not a hard no on the current approach, but I think the above would
probably be a little safer and harder to misuse.  It would require extra
patches to convert existing callers, though, but they aren't hugely
numerous, and the conversion could probably be mostly automated.
-- 
brian m. carlson (they/them or he/him)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/6] hash.h: introduce `unsafe_hash_algo()`
  2024-11-22  0:39     ` brian m. carlson
@ 2024-11-22  8:25       ` Jeff King
  2024-11-22 20:37         ` brian m. carlson
  0 siblings, 1 reply; 60+ messages in thread
From: Jeff King @ 2024-11-22  8:25 UTC (permalink / raw)
  To: brian m. carlson; +Cc: Taylor Blau, git, Elijah Newren, Junio C Hamano

On Fri, Nov 22, 2024 at 12:39:30AM +0000, brian m. carlson wrote:

> Yeah, I'm also a little nervous about this change with hash_algo_by_ptr.
> I think we had discussed in the other thread about doing something like
> this:
> 
> struct git_hash_algo_fns {
> [...]
>
> I'm not a hard no on the current approach, but I think the above would
> probably be a little safer and harder to misuse.  It would require extra
> patches to convert existing callers, though, but they aren't hugely
> numerous, and the conversion could probably be mostly automated.

Yeah, I was worried about the fallout to callers. But it might be worth
seeing how extensive it is.

...

OK, here's a quick and dirty conversion that applies on top of Taylor's
patches 1 and 2 (and then patches 3 and on would need to adjust to the
new world order).

There are quite a lot of spots that needed to be adjusted, but the most
part it was just s/->\([a-z]\)_fn/->fn.\1/.  (I'll leave adjusting the
unsafe_ variants as an exercise for the reader).

The naming convention was just what I made up, but one nice effect is
that it replaces "_fn" with "fn.", so the line lengths remain the same.
:) It does mean there's a function called "final", which is a keyword in
C++. I don't know if that matters to us or not (I feel like we were
trying to avoid those, but it appears as a function elsewhere, too).

So I dunno. It is a one-time pain, but I think the result harmonizes the
type system with the concept better.

---
 builtin/fast-import.c      |  18 +++---
 builtin/index-pack.c       |  18 +++---
 builtin/patch-id.c         |  10 ++--
 builtin/receive-pack.c     |  22 +++----
 builtin/unpack-objects.c   |  12 ++--
 bulk-checkin.c             |  10 ++--
 csum-file.c                |  18 +++---
 diff.c                     |  28 ++++-----
 hash.h                     |  37 ++++--------
 http-push.c                |   6 +-
 http.c                     |   8 +--
 object-file.c              | 112 +++++++++++++++++++----------------
 pack-check.c               |   6 +-
 pack-write.c               |  20 +++----
 read-cache.c               |  20 +++----
 rerere.c                   |   8 +--
 t/helper/test-hash-speed.c |   6 +-
 t/helper/test-hash.c       |   6 +-
 t/unit-tests/t-hash.c      |   6 +-
 trace2/tr2_sid.c           |   6 +-
 20 files changed, 189 insertions(+), 188 deletions(-)

diff --git a/builtin/fast-import.c b/builtin/fast-import.c
index 76d5c20f14..b7870d1f1e 100644
--- a/builtin/fast-import.c
+++ b/builtin/fast-import.c
@@ -953,10 +953,10 @@ static int store_object(
 
 	hdrlen = format_object_header((char *)hdr, sizeof(hdr), type,
 				      dat->len);
-	the_hash_algo->init_fn(&c);
-	the_hash_algo->update_fn(&c, hdr, hdrlen);
-	the_hash_algo->update_fn(&c, dat->buf, dat->len);
-	the_hash_algo->final_oid_fn(&oid, &c);
+	the_hash_algo->fn.init(&c);
+	the_hash_algo->fn.update(&c, hdr, hdrlen);
+	the_hash_algo->fn.update(&c, dat->buf, dat->len);
+	the_hash_algo->fn.final_oid(&oid, &c);
 	if (oidout)
 		oidcpy(oidout, &oid);
 
@@ -1101,14 +1101,14 @@ static void stream_blob(uintmax_t len, struct object_id *oidout, uintmax_t mark)
 		|| (pack_size + PACK_SIZE_THRESHOLD + len) < pack_size)
 		cycle_packfile();
 
-	the_hash_algo->init_fn(&checkpoint.ctx);
+	the_hash_algo->fn.init(&checkpoint.ctx);
 	hashfile_checkpoint(pack_file, &checkpoint);
 	offset = checkpoint.offset;
 
 	hdrlen = format_object_header((char *)out_buf, out_sz, OBJ_BLOB, len);
 
-	the_hash_algo->init_fn(&c);
-	the_hash_algo->update_fn(&c, out_buf, hdrlen);
+	the_hash_algo->fn.init(&c);
+	the_hash_algo->fn.update(&c, out_buf, hdrlen);
 
 	crc32_begin(pack_file);
 
@@ -1126,7 +1126,7 @@ static void stream_blob(uintmax_t len, struct object_id *oidout, uintmax_t mark)
 			if (!n && feof(stdin))
 				die("EOF in data (%" PRIuMAX " bytes remaining)", len);
 
-			the_hash_algo->update_fn(&c, in_buf, n);
+			the_hash_algo->fn.update(&c, in_buf, n);
 			s.next_in = in_buf;
 			s.avail_in = n;
 			len -= n;
@@ -1152,7 +1152,7 @@ static void stream_blob(uintmax_t len, struct object_id *oidout, uintmax_t mark)
 		}
 	}
 	git_deflate_end(&s);
-	the_hash_algo->final_oid_fn(&oid, &c);
+	the_hash_algo->fn.final_oid(&oid, &c);
 
 	if (oidout)
 		oidcpy(oidout, &oid);
diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 08b340552f..c5b5573edf 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -298,7 +298,7 @@ static void flush(void)
 	if (input_offset) {
 		if (output_fd >= 0)
 			write_or_die(output_fd, input_buffer, input_offset);
-		the_hash_algo->update_fn(&input_ctx, input_buffer, input_offset);
+		the_hash_algo->fn.update(&input_ctx, input_buffer, input_offset);
 		memmove(input_buffer, input_buffer + input_offset, input_len);
 		input_offset = 0;
 	}
@@ -371,7 +371,7 @@ static const char *open_pack_file(const char *pack_name)
 		output_fd = -1;
 		nothread_data.pack_fd = input_fd;
 	}
-	the_hash_algo->init_fn(&input_ctx);
+	the_hash_algo->fn.init(&input_ctx);
 	return pack_name;
 }
 
@@ -476,8 +476,8 @@ static void *unpack_entry_data(off_t offset, unsigned long size,
 
 	if (!is_delta_type(type)) {
 		hdrlen = format_object_header(hdr, sizeof(hdr), type, size);
-		the_hash_algo->init_fn(&c);
-		the_hash_algo->update_fn(&c, hdr, hdrlen);
+		the_hash_algo->fn.init(&c);
+		the_hash_algo->fn.update(&c, hdr, hdrlen);
 	} else
 		oid = NULL;
 	if (type == OBJ_BLOB && size > big_file_threshold)
@@ -497,7 +497,7 @@ static void *unpack_entry_data(off_t offset, unsigned long size,
 		status = git_inflate(&stream, 0);
 		use(input_len - stream.avail_in);
 		if (oid)
-			the_hash_algo->update_fn(&c, last_out, stream.next_out - last_out);
+			the_hash_algo->fn.update(&c, last_out, stream.next_out - last_out);
 		if (buf == fixed_buf) {
 			stream.next_out = buf;
 			stream.avail_out = sizeof(fixed_buf);
@@ -507,7 +507,7 @@ static void *unpack_entry_data(off_t offset, unsigned long size,
 		bad_object(offset, _("inflate returned %d"), status);
 	git_inflate_end(&stream);
 	if (oid)
-		the_hash_algo->final_oid_fn(oid, &c);
+		the_hash_algo->fn.final_oid(oid, &c);
 	return buf == fixed_buf ? NULL : buf;
 }
 
@@ -1256,9 +1256,9 @@ static void parse_pack_objects(unsigned char *hash)
 
 	/* Check pack integrity */
 	flush();
-	the_hash_algo->init_fn(&tmp_ctx);
-	the_hash_algo->clone_fn(&tmp_ctx, &input_ctx);
-	the_hash_algo->final_fn(hash, &tmp_ctx);
+	the_hash_algo->fn.init(&tmp_ctx);
+	the_hash_algo->fn.clone(&tmp_ctx, &input_ctx);
+	the_hash_algo->fn.final(hash, &tmp_ctx);
 	if (!hasheq(fill(the_hash_algo->rawsz), hash, the_repository->hash_algo))
 		die(_("pack is corrupted (SHA1 mismatch)"));
 	use(the_hash_algo->rawsz);
diff --git a/builtin/patch-id.c b/builtin/patch-id.c
index 93b398e391..3bf1367341 100644
--- a/builtin/patch-id.c
+++ b/builtin/patch-id.c
@@ -70,7 +70,7 @@ static int get_one_patchid(struct object_id *next_oid, struct object_id *result,
 	char pre_oid_str[GIT_MAX_HEXSZ + 1], post_oid_str[GIT_MAX_HEXSZ + 1];
 	git_hash_ctx ctx;
 
-	the_hash_algo->init_fn(&ctx);
+	the_hash_algo->fn.init(&ctx);
 	oidclr(result, the_repository->hash_algo);
 
 	while (strbuf_getwholeline(line_buf, stdin, '\n') != EOF) {
@@ -83,7 +83,7 @@ static int get_one_patchid(struct object_id *next_oid, struct object_id *result,
 		    !skip_prefix(line, "From ", &p) &&
 		    starts_with(line, "\\ ") && 12 < strlen(line)) {
 			if (verbatim)
-				the_hash_algo->update_fn(&ctx, line, strlen(line));
+				the_hash_algo->fn.update(&ctx, line, strlen(line));
 			continue;
 		}
 
@@ -102,9 +102,9 @@ static int get_one_patchid(struct object_id *next_oid, struct object_id *result,
 			    starts_with(line, "Binary files")) {
 				diff_is_binary = 1;
 				before = 0;
-				the_hash_algo->update_fn(&ctx, pre_oid_str,
+				the_hash_algo->fn.update(&ctx, pre_oid_str,
 							 strlen(pre_oid_str));
-				the_hash_algo->update_fn(&ctx, post_oid_str,
+				the_hash_algo->fn.update(&ctx, post_oid_str,
 							 strlen(post_oid_str));
 				if (stable)
 					flush_one_hunk(result, &ctx);
@@ -163,7 +163,7 @@ static int get_one_patchid(struct object_id *next_oid, struct object_id *result,
 		/* Add line to hash algo (possibly removing whitespace) */
 		len = verbatim ? strlen(line) : remove_space(line);
 		patchlen += len;
-		the_hash_algo->update_fn(&ctx, line, len);
+		the_hash_algo->fn.update(&ctx, line, len);
 	}
 
 	if (!found_next)
diff --git a/builtin/receive-pack.c b/builtin/receive-pack.c
index ab5b20e39c..b7aa6cc556 100644
--- a/builtin/receive-pack.c
+++ b/builtin/receive-pack.c
@@ -569,9 +569,9 @@ static void hmac_hash(unsigned char *out,
 	/* RFC 2104 2. (1) */
 	memset(key, '\0', GIT_MAX_BLKSZ);
 	if (the_hash_algo->blksz < key_len) {
-		the_hash_algo->init_fn(&ctx);
-		the_hash_algo->update_fn(&ctx, key_in, key_len);
-		the_hash_algo->final_fn(key, &ctx);
+		the_hash_algo->fn.init(&ctx);
+		the_hash_algo->fn.update(&ctx, key_in, key_len);
+		the_hash_algo->fn.final(key, &ctx);
 	} else {
 		memcpy(key, key_in, key_len);
 	}
@@ -583,16 +583,16 @@ static void hmac_hash(unsigned char *out,
 	}
 
 	/* RFC 2104 2. (3) & (4) */
-	the_hash_algo->init_fn(&ctx);
-	the_hash_algo->update_fn(&ctx, k_ipad, sizeof(k_ipad));
-	the_hash_algo->update_fn(&ctx, text, text_len);
-	the_hash_algo->final_fn(out, &ctx);
+	the_hash_algo->fn.init(&ctx);
+	the_hash_algo->fn.update(&ctx, k_ipad, sizeof(k_ipad));
+	the_hash_algo->fn.update(&ctx, text, text_len);
+	the_hash_algo->fn.final(out, &ctx);
 
 	/* RFC 2104 2. (6) & (7) */
-	the_hash_algo->init_fn(&ctx);
-	the_hash_algo->update_fn(&ctx, k_opad, sizeof(k_opad));
-	the_hash_algo->update_fn(&ctx, out, the_hash_algo->rawsz);
-	the_hash_algo->final_fn(out, &ctx);
+	the_hash_algo->fn.init(&ctx);
+	the_hash_algo->fn.update(&ctx, k_opad, sizeof(k_opad));
+	the_hash_algo->fn.update(&ctx, out, the_hash_algo->rawsz);
+	the_hash_algo->fn.final(out, &ctx);
 }
 
 static char *prepare_push_cert_nonce(const char *path, timestamp_t stamp)
diff --git a/builtin/unpack-objects.c b/builtin/unpack-objects.c
index 02b8d02f63..c263343531 100644
--- a/builtin/unpack-objects.c
+++ b/builtin/unpack-objects.c
@@ -67,7 +67,7 @@ static void *fill(int min)
 	if (min > sizeof(buffer))
 		die("cannot fill %d bytes", min);
 	if (offset) {
-		the_hash_algo->update_fn(&ctx, buffer, offset);
+		the_hash_algo->fn.update(&ctx, buffer, offset);
 		memmove(buffer, buffer + offset, len);
 		offset = 0;
 	}
@@ -667,12 +667,12 @@ int cmd_unpack_objects(int argc,
 		/* We don't take any non-flag arguments now.. Maybe some day */
 		usage(unpack_usage);
 	}
-	the_hash_algo->init_fn(&ctx);
+	the_hash_algo->fn.init(&ctx);
 	unpack_all();
-	the_hash_algo->update_fn(&ctx, buffer, offset);
-	the_hash_algo->init_fn(&tmp_ctx);
-	the_hash_algo->clone_fn(&tmp_ctx, &ctx);
-	the_hash_algo->final_oid_fn(&oid, &tmp_ctx);
+	the_hash_algo->fn.update(&ctx, buffer, offset);
+	the_hash_algo->fn.init(&tmp_ctx);
+	the_hash_algo->fn.clone(&tmp_ctx, &ctx);
+	the_hash_algo->fn.final_oid(&oid, &tmp_ctx);
 	if (strict) {
 		write_rest();
 		if (fsck_finish(&fsck_options))
diff --git a/bulk-checkin.c b/bulk-checkin.c
index 2753d5bbe4..7fc545703a 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -193,7 +193,7 @@ static int stream_blob_to_pack(struct bulk_checkin_packfile *state,
 				if (rsize < hsize)
 					hsize = rsize;
 				if (hsize)
-					the_hash_algo->update_fn(ctx, ibuf, hsize);
+					the_hash_algo->fn.update(ctx, ibuf, hsize);
 				*already_hashed_to = offset;
 			}
 			s.next_in = ibuf;
@@ -269,9 +269,9 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 
 	header_len = format_object_header((char *)obuf, sizeof(obuf),
 					  OBJ_BLOB, size);
-	the_hash_algo->init_fn(&ctx);
-	the_hash_algo->update_fn(&ctx, obuf, header_len);
-	the_hash_algo->init_fn(&checkpoint.ctx);
+	the_hash_algo->fn.init(&ctx);
+	the_hash_algo->fn.update(&ctx, obuf, header_len);
+	the_hash_algo->fn.init(&checkpoint.ctx);
 
 	/* Note: idx is non-NULL when we are writing */
 	if ((flags & HASH_WRITE_OBJECT) != 0)
@@ -302,7 +302,7 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 		if (lseek(fd, seekback, SEEK_SET) == (off_t) -1)
 			return error("cannot seek back");
 	}
-	the_hash_algo->final_oid_fn(result_oid, &ctx);
+	the_hash_algo->fn.final_oid(result_oid, &ctx);
 	if (!idx)
 		return 0;
 
diff --git a/csum-file.c b/csum-file.c
index 107bc4c96f..05e20de441 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -50,7 +50,7 @@ void hashflush(struct hashfile *f)
 
 	if (offset) {
 		if (!f->skip_hash)
-			f->algop->unsafe_update_fn(&f->ctx, f->buffer, offset);
+			f->algop->unsafe_fn.update(&f->ctx, f->buffer, offset);
 		flush(f, f->buffer, offset);
 		f->offset = 0;
 	}
@@ -73,7 +73,7 @@ int finalize_hashfile(struct hashfile *f, unsigned char *result,
 	if (f->skip_hash)
 		hashclr(f->buffer, f->algop);
 	else
-		f->algop->unsafe_final_fn(f->buffer, &f->ctx);
+		f->algop->unsafe_fn.final(f->buffer, &f->ctx);
 
 	if (result)
 		hashcpy(result, f->buffer, f->algop);
@@ -128,7 +128,7 @@ void hashwrite(struct hashfile *f, const void *buf, unsigned int count)
 			 * f->offset is necessarily zero.
 			 */
 			if (!f->skip_hash)
-				f->algop->unsafe_update_fn(&f->ctx, buf, nr);
+				f->algop->unsafe_fn.update(&f->ctx, buf, nr);
 			flush(f, buf, nr);
 		} else {
 			/*
@@ -176,7 +176,7 @@ static struct hashfile *hashfd_internal(int fd, const char *name,
 	f->skip_hash = 0;
 
 	f->algop = the_hash_algo;
-	f->algop->unsafe_init_fn(&f->ctx);
+	f->algop->unsafe_fn.init(&f->ctx);
 
 	f->buffer_len = buffer_len;
 	f->buffer = xmalloc(buffer_len);
@@ -210,7 +210,7 @@ void hashfile_checkpoint(struct hashfile *f, struct hashfile_checkpoint *checkpo
 {
 	hashflush(f);
 	checkpoint->offset = f->total;
-	f->algop->unsafe_clone_fn(&checkpoint->ctx, &f->ctx);
+	f->algop->unsafe_fn.clone(&checkpoint->ctx, &f->ctx);
 }
 
 int hashfile_truncate(struct hashfile *f, struct hashfile_checkpoint *checkpoint)
@@ -221,7 +221,7 @@ int hashfile_truncate(struct hashfile *f, struct hashfile_checkpoint *checkpoint
 	    lseek(f->fd, offset, SEEK_SET) != offset)
 		return -1;
 	f->total = offset;
-	f->algop->unsafe_clone_fn(&f->ctx, &checkpoint->ctx);
+	f->algop->unsafe_fn.clone(&f->ctx, &checkpoint->ctx);
 	f->offset = 0; /* hashflush() was called in checkpoint */
 	return 0;
 }
@@ -248,9 +248,9 @@ int hashfile_checksum_valid(const unsigned char *data, size_t total_len)
 	if (total_len < algop->rawsz)
 		return 0; /* say "too short"? */
 
-	algop->unsafe_init_fn(&ctx);
-	algop->unsafe_update_fn(&ctx, data, data_len);
-	algop->unsafe_final_fn(got, &ctx);
+	algop->unsafe_fn.init(&ctx);
+	algop->unsafe_fn.update(&ctx, data, data_len);
+	algop->unsafe_fn.final(got, &ctx);
 
 	return hasheq(got, data + data_len, algop);
 }
diff --git a/diff.c b/diff.c
index dceac20d18..85d776fa37 100644
--- a/diff.c
+++ b/diff.c
@@ -6413,8 +6413,8 @@ void flush_one_hunk(struct object_id *result, git_hash_ctx *ctx)
 	unsigned short carry = 0;
 	int i;
 
-	the_hash_algo->final_fn(hash, ctx);
-	the_hash_algo->init_fn(ctx);
+	the_hash_algo->fn.final(hash, ctx);
+	the_hash_algo->fn.init(ctx);
 	/* 20-byte sum, with carry */
 	for (i = 0; i < the_hash_algo->rawsz; ++i) {
 		carry += result->hash[i] + hash[i];
@@ -6432,22 +6432,22 @@ static int patch_id_consume(void *priv, char *line, unsigned long len)
 		return 0;
 	new_len = remove_space(line, len);
 
-	the_hash_algo->update_fn(data->ctx, line, new_len);
+	the_hash_algo->fn.update(data->ctx, line, new_len);
 	data->patchlen += new_len;
 	return 0;
 }
 
 static void patch_id_add_string(git_hash_ctx *ctx, const char *str)
 {
-	the_hash_algo->update_fn(ctx, str, strlen(str));
+	the_hash_algo->fn.update(ctx, str, strlen(str));
 }
 
 static void patch_id_add_mode(git_hash_ctx *ctx, unsigned mode)
 {
 	/* large enough for 2^32 in octal */
 	char buf[12];
 	int len = xsnprintf(buf, sizeof(buf), "%06o", mode);
-	the_hash_algo->update_fn(ctx, buf, len);
+	the_hash_algo->fn.update(ctx, buf, len);
 }
 
 /* returns 0 upon success, and writes result into oid */
@@ -6458,7 +6458,7 @@ static int diff_get_patch_id(struct diff_options *options, struct object_id *oid
 	git_hash_ctx ctx;
 	struct patch_id_t data;
 
-	the_hash_algo->init_fn(&ctx);
+	the_hash_algo->fn.init(&ctx);
 	memset(&data, 0, sizeof(struct patch_id_t));
 	data.ctx = &ctx;
 	oidclr(oid, the_repository->hash_algo);
@@ -6491,9 +6491,9 @@ static int diff_get_patch_id(struct diff_options *options, struct object_id *oid
 		len2 = remove_space(p->two->path, strlen(p->two->path));
 		patch_id_add_string(&ctx, "diff--git");
 		patch_id_add_string(&ctx, "a/");
-		the_hash_algo->update_fn(&ctx, p->one->path, len1);
+		the_hash_algo->fn.update(&ctx, p->one->path, len1);
 		patch_id_add_string(&ctx, "b/");
-		the_hash_algo->update_fn(&ctx, p->two->path, len2);
+		the_hash_algo->fn.update(&ctx, p->two->path, len2);
 
 		if (p->one->mode == 0) {
 			patch_id_add_string(&ctx, "newfilemode");
@@ -6512,24 +6512,24 @@ static int diff_get_patch_id(struct diff_options *options, struct object_id *oid
 			/* don't do anything since we're only populating header info */
 		} else if (diff_filespec_is_binary(options->repo, p->one) ||
 		    diff_filespec_is_binary(options->repo, p->two)) {
-			the_hash_algo->update_fn(&ctx, oid_to_hex(&p->one->oid),
+			the_hash_algo->fn.update(&ctx, oid_to_hex(&p->one->oid),
 					the_hash_algo->hexsz);
-			the_hash_algo->update_fn(&ctx, oid_to_hex(&p->two->oid),
+			the_hash_algo->fn.update(&ctx, oid_to_hex(&p->two->oid),
 					the_hash_algo->hexsz);
 		} else {
 			if (p->one->mode == 0) {
 				patch_id_add_string(&ctx, "---/dev/null");
 				patch_id_add_string(&ctx, "+++b/");
-				the_hash_algo->update_fn(&ctx, p->two->path, len2);
+				the_hash_algo->fn.update(&ctx, p->two->path, len2);
 			} else if (p->two->mode == 0) {
 				patch_id_add_string(&ctx, "---a/");
-				the_hash_algo->update_fn(&ctx, p->one->path, len1);
+				the_hash_algo->fn.update(&ctx, p->one->path, len1);
 				patch_id_add_string(&ctx, "+++/dev/null");
 			} else {
 				patch_id_add_string(&ctx, "---a/");
-				the_hash_algo->update_fn(&ctx, p->one->path, len1);
+				the_hash_algo->fn.update(&ctx, p->one->path, len1);
 				patch_id_add_string(&ctx, "+++b/");
-				the_hash_algo->update_fn(&ctx, p->two->path, len2);
+				the_hash_algo->fn.update(&ctx, p->two->path, len2);
 			}
 
 			if (fill_mmfile(options->repo, &mf1, p->one) < 0 ||
diff --git a/hash.h b/hash.h
index 756166ce5e..6107d5b47a 100644
--- a/hash.h
+++ b/hash.h
@@ -267,35 +267,24 @@ struct git_hash_algo {
 	/* The block size of the hash. */
 	size_t blksz;
 
-	/* The hash initialization function. */
-	git_hash_init_fn init_fn;
+	struct git_hash_algo_fns {
+		/* The hash initialization function. */
+		git_hash_init_fn init;
 
-	/* The hash context cloning function. */
-	git_hash_clone_fn clone_fn;
+		/* The hash context cloning function. */
+		git_hash_clone_fn clone;
 
-	/* The hash update function. */
-	git_hash_update_fn update_fn;
+		/* The hash update function. */
+		git_hash_update_fn update;
 
-	/* The hash finalization function. */
-	git_hash_final_fn final_fn;
+		/* The hash finalization function. */
+		git_hash_final_fn final;
 
-	/* The hash finalization function for object IDs. */
-	git_hash_final_oid_fn final_oid_fn;
+		/* The hash finalization function for object IDs. */
+		git_hash_final_oid_fn final_oid;
+	} fn;
 
-	/* The non-cryptographic hash initialization function. */
-	git_hash_init_fn unsafe_init_fn;
-
-	/* The non-cryptographic hash context cloning function. */
-	git_hash_clone_fn unsafe_clone_fn;
-
-	/* The non-cryptographic hash update function. */
-	git_hash_update_fn unsafe_update_fn;
-
-	/* The non-cryptographic hash finalization function. */
-	git_hash_final_fn unsafe_final_fn;
-
-	/* The non-cryptographic hash finalization function. */
-	git_hash_final_oid_fn unsafe_final_oid_fn;
+	struct git_hash_algo_fns unsafe_fn;
 
 	/* The OID of the empty tree. */
 	const struct object_id *empty_tree;
diff --git a/http-push.c b/http-push.c
index 4d24e6b8d4..9659e777d1 100644
--- a/http-push.c
+++ b/http-push.c
@@ -773,9 +773,9 @@ static void handle_new_lock_ctx(struct xml_ctx *ctx, int tag_closed)
 		} else if (!strcmp(ctx->name, DAV_ACTIVELOCK_TOKEN)) {
 			lock->token = xstrdup(ctx->cdata);
 
-			the_hash_algo->init_fn(&hash_ctx);
-			the_hash_algo->update_fn(&hash_ctx, lock->token, strlen(lock->token));
-			the_hash_algo->final_fn(lock_token_hash, &hash_ctx);
+			the_hash_algo->fn.init(&hash_ctx);
+			the_hash_algo->fn.update(&hash_ctx, lock->token, strlen(lock->token));
+			the_hash_algo->fn.final(lock_token_hash, &hash_ctx);
 
 			lock->tmpfile_suffix[0] = '_';
 			memcpy(lock->tmpfile_suffix + 1, hash_to_hex(lock_token_hash), the_hash_algo->hexsz);
diff --git a/http.c b/http.c
index 58242b9d2d..b3f31b52e6 100644
--- a/http.c
+++ b/http.c
@@ -2654,7 +2654,7 @@ static size_t fwrite_sha1_file(char *ptr, size_t eltsize, size_t nmemb,
 		freq->stream.next_out = expn;
 		freq->stream.avail_out = sizeof(expn);
 		freq->zret = git_inflate(&freq->stream, Z_SYNC_FLUSH);
-		the_hash_algo->update_fn(&freq->c, expn,
+		the_hash_algo->fn.update(&freq->c, expn,
 					 sizeof(expn) - freq->stream.avail_out);
 	} while (freq->stream.avail_in && freq->zret == Z_OK);
 	return nmemb;
@@ -2713,7 +2713,7 @@ struct http_object_request *new_http_object_request(const char *base_url,
 
 	git_inflate_init(&freq->stream);
 
-	the_hash_algo->init_fn(&freq->c);
+	the_hash_algo->fn.init(&freq->c);
 
 	freq->url = get_remote_object_url(base_url, hex, 0);
 
@@ -2749,7 +2749,7 @@ struct http_object_request *new_http_object_request(const char *base_url,
 		git_inflate_end(&freq->stream);
 		memset(&freq->stream, 0, sizeof(freq->stream));
 		git_inflate_init(&freq->stream);
-		the_hash_algo->init_fn(&freq->c);
+		the_hash_algo->fn.init(&freq->c);
 		if (prev_posn>0) {
 			prev_posn = 0;
 			lseek(freq->localfile, 0, SEEK_SET);
@@ -2820,7 +2820,7 @@ int finish_http_object_request(struct http_object_request *freq)
 		return -1;
 	}
 
-	the_hash_algo->final_oid_fn(&freq->real_oid, &freq->c);
+	the_hash_algo->fn.final_oid(&freq->real_oid, &freq->c);
 	if (freq->zret != Z_STREAM_END) {
 		unlink_or_warn(freq->tmpfile.buf);
 		return -1;
diff --git a/object-file.c b/object-file.c
index b1a3463852..c8399745d8 100644
--- a/object-file.c
+++ b/object-file.c
@@ -211,16 +211,20 @@ const struct git_hash_algo hash_algos[GIT_HASH_NALGOS] = {
 		.rawsz = 0,
 		.hexsz = 0,
 		.blksz = 0,
-		.init_fn = git_hash_unknown_init,
-		.clone_fn = git_hash_unknown_clone,
-		.update_fn = git_hash_unknown_update,
-		.final_fn = git_hash_unknown_final,
-		.final_oid_fn = git_hash_unknown_final_oid,
-		.unsafe_init_fn = git_hash_unknown_init,
-		.unsafe_clone_fn = git_hash_unknown_clone,
-		.unsafe_update_fn = git_hash_unknown_update,
-		.unsafe_final_fn = git_hash_unknown_final,
-		.unsafe_final_oid_fn = git_hash_unknown_final_oid,
+		.fn = {
+			.init = git_hash_unknown_init,
+			.clone = git_hash_unknown_clone,
+			.update = git_hash_unknown_update,
+			.final = git_hash_unknown_final,
+			.final_oid = git_hash_unknown_final_oid,
+		},
+		.unsafe_fn = {
+			.init = git_hash_unknown_init,
+			.clone = git_hash_unknown_clone,
+			.update = git_hash_unknown_update,
+			.final = git_hash_unknown_final,
+			.final_oid = git_hash_unknown_final_oid,
+		},
 		.empty_tree = NULL,
 		.empty_blob = NULL,
 		.null_oid = NULL,
@@ -231,16 +235,20 @@ const struct git_hash_algo hash_algos[GIT_HASH_NALGOS] = {
 		.rawsz = GIT_SHA1_RAWSZ,
 		.hexsz = GIT_SHA1_HEXSZ,
 		.blksz = GIT_SHA1_BLKSZ,
-		.init_fn = git_hash_sha1_init,
-		.clone_fn = git_hash_sha1_clone,
-		.update_fn = git_hash_sha1_update,
-		.final_fn = git_hash_sha1_final,
-		.final_oid_fn = git_hash_sha1_final_oid,
-		.unsafe_init_fn = git_hash_sha1_init_unsafe,
-		.unsafe_clone_fn = git_hash_sha1_clone_unsafe,
-		.unsafe_update_fn = git_hash_sha1_update_unsafe,
-		.unsafe_final_fn = git_hash_sha1_final_unsafe,
-		.unsafe_final_oid_fn = git_hash_sha1_final_oid_unsafe,
+		.fn = {
+			.init = git_hash_sha1_init,
+			.clone = git_hash_sha1_clone,
+			.update = git_hash_sha1_update,
+			.final = git_hash_sha1_final,
+			.final_oid = git_hash_sha1_final_oid,
+		},
+		.unsafe_fn = {
+			.init = git_hash_sha1_init_unsafe,
+			.clone = git_hash_sha1_clone_unsafe,
+			.update = git_hash_sha1_update_unsafe,
+			.final = git_hash_sha1_final_unsafe,
+			.final_oid = git_hash_sha1_final_oid_unsafe,
+		},
 		.empty_tree = &empty_tree_oid,
 		.empty_blob = &empty_blob_oid,
 		.null_oid = &null_oid_sha1,
@@ -251,16 +259,20 @@ const struct git_hash_algo hash_algos[GIT_HASH_NALGOS] = {
 		.rawsz = GIT_SHA256_RAWSZ,
 		.hexsz = GIT_SHA256_HEXSZ,
 		.blksz = GIT_SHA256_BLKSZ,
-		.init_fn = git_hash_sha256_init,
-		.clone_fn = git_hash_sha256_clone,
-		.update_fn = git_hash_sha256_update,
-		.final_fn = git_hash_sha256_final,
-		.final_oid_fn = git_hash_sha256_final_oid,
-		.unsafe_init_fn = git_hash_sha256_init,
-		.unsafe_clone_fn = git_hash_sha256_clone,
-		.unsafe_update_fn = git_hash_sha256_update,
-		.unsafe_final_fn = git_hash_sha256_final,
-		.unsafe_final_oid_fn = git_hash_sha256_final_oid,
+		.fn = {
+			.init = git_hash_sha256_init,
+			.clone = git_hash_sha256_clone,
+			.update = git_hash_sha256_update,
+			.final = git_hash_sha256_final,
+			.final_oid = git_hash_sha256_final_oid,
+		},
+		.unsafe_fn = {
+			.init = git_hash_sha256_init,
+			.clone = git_hash_sha256_clone,
+			.update = git_hash_sha256_update,
+			.final = git_hash_sha256_final,
+			.final_oid = git_hash_sha256_final_oid,
+		},
 		.empty_tree = &empty_tree_oid_sha256,
 		.empty_blob = &empty_blob_oid_sha256,
 		.null_oid = &null_oid_sha256,
@@ -1185,8 +1197,8 @@ int stream_object_signature(struct repository *r, const struct object_id *oid)
 	hdrlen = format_object_header(hdr, sizeof(hdr), obj_type, size);
 
 	/* Sha1.. */
-	r->hash_algo->init_fn(&c);
-	r->hash_algo->update_fn(&c, hdr, hdrlen);
+	r->hash_algo->fn.init(&c);
+	r->hash_algo->fn.update(&c, hdr, hdrlen);
 	for (;;) {
 		char buf[1024 * 16];
 		ssize_t readlen = read_istream(st, buf, sizeof(buf));
@@ -1197,9 +1209,9 @@ int stream_object_signature(struct repository *r, const struct object_id *oid)
 		}
 		if (!readlen)
 			break;
-		r->hash_algo->update_fn(&c, buf, readlen);
+		r->hash_algo->fn.update(&c, buf, readlen);
 	}
-	r->hash_algo->final_oid_fn(&real_oid, &c);
+	r->hash_algo->fn.final_oid(&real_oid, &c);
 	close_istream(st);
 	return !oideq(oid, &real_oid) ? -1 : 0;
 }
@@ -1943,10 +1955,10 @@ static void hash_object_body(const struct git_hash_algo *algo, git_hash_ctx *c,
 			     struct object_id *oid,
 			     char *hdr, int *hdrlen)
 {
-	algo->init_fn(c);
-	algo->update_fn(c, hdr, *hdrlen);
-	algo->update_fn(c, buf, len);
-	algo->final_oid_fn(oid, c);
+	algo->fn.init(c);
+	algo->fn.update(c, hdr, *hdrlen);
+	algo->fn.update(c, buf, len);
+	algo->fn.final_oid(oid, c);
 }
 
 static void write_object_file_prepare(const struct git_hash_algo *algo,
@@ -2206,18 +2218,18 @@ static int start_loose_object_common(struct strbuf *tmp_file,
 	git_deflate_init(stream, zlib_compression_level);
 	stream->next_out = buf;
 	stream->avail_out = buflen;
-	algo->init_fn(c);
+	algo->fn.init(c);
 	if (compat && compat_c)
-		compat->init_fn(compat_c);
+		compat->fn.init(compat_c);
 
 	/*  Start to feed header to zlib stream */
 	stream->next_in = (unsigned char *)hdr;
 	stream->avail_in = hdrlen;
 	while (git_deflate(stream, 0) == Z_OK)
 		; /* nothing */
-	algo->update_fn(c, hdr, hdrlen);
+	algo->fn.update(c, hdr, hdrlen);
 	if (compat && compat_c)
-		compat->update_fn(compat_c, hdr, hdrlen);
+		compat->fn.update(compat_c, hdr, hdrlen);
 
 	return fd;
 }
@@ -2238,9 +2250,9 @@ static int write_loose_object_common(git_hash_ctx *c, git_hash_ctx *compat_c,
 	int ret;
 
 	ret = git_deflate(stream, flush ? Z_FINISH : 0);
-	algo->update_fn(c, in0, stream->next_in - in0);
+	algo->fn.update(c, in0, stream->next_in - in0);
 	if (compat && compat_c)
-		compat->update_fn(compat_c, in0, stream->next_in - in0);
+		compat->fn.update(compat_c, in0, stream->next_in - in0);
 	if (write_in_full(fd, compressed, stream->next_out - compressed) < 0)
 		die_errno(_("unable to write loose object file"));
 	stream->next_out = compressed;
@@ -2267,9 +2279,9 @@ static int end_loose_object_common(git_hash_ctx *c, git_hash_ctx *compat_c,
 	ret = git_deflate_end_gently(stream);
 	if (ret != Z_OK)
 		return ret;
-	algo->final_oid_fn(oid, c);
+	algo->fn.final_oid(oid, c);
 	if (compat && compat_c)
-		compat->final_oid_fn(compat_oid, compat_c);
+		compat->fn.final_oid(compat_oid, compat_c);
 
 	return Z_OK;
 }
@@ -3027,8 +3039,8 @@ static int check_stream_oid(git_zstream *stream,
 	unsigned long total_read;
 	int status = Z_OK;
 
-	the_hash_algo->init_fn(&c);
-	the_hash_algo->update_fn(&c, hdr, stream->total_out);
+	the_hash_algo->fn.init(&c);
+	the_hash_algo->fn.update(&c, hdr, stream->total_out);
 
 	/*
 	 * We already read some bytes into hdr, but the ones up to the NUL
@@ -3048,7 +3060,7 @@ static int check_stream_oid(git_zstream *stream,
 		if (size - total_read < stream->avail_out)
 			stream->avail_out = size - total_read;
 		status = git_inflate(stream, Z_FINISH);
-		the_hash_algo->update_fn(&c, buf, stream->next_out - buf);
+		the_hash_algo->fn.update(&c, buf, stream->next_out - buf);
 		total_read += stream->next_out - buf;
 	}
 	git_inflate_end(stream);
@@ -3063,7 +3075,7 @@ static int check_stream_oid(git_zstream *stream,
 		return -1;
 	}
 
-	the_hash_algo->final_oid_fn(&real_oid, &c);
+	the_hash_algo->fn.final_oid(&real_oid, &c);
 	if (!oideq(expected_oid, &real_oid)) {
 		error(_("hash mismatch for %s (expected %s)"), path,
 		      oid_to_hex(expected_oid));
diff --git a/pack-check.c b/pack-check.c
index e883dae3f2..2c1c2c54fa 100644
--- a/pack-check.c
+++ b/pack-check.c
@@ -67,7 +67,7 @@ static int verify_packfile(struct repository *r,
 	if (!is_pack_valid(p))
 		return error("packfile %s cannot be accessed", p->pack_name);
 
-	r->hash_algo->init_fn(&ctx);
+	r->hash_algo->fn.init(&ctx);
 	do {
 		unsigned long remaining;
 		unsigned char *in = use_pack(p, w_curs, offset, &remaining);
@@ -76,9 +76,9 @@ static int verify_packfile(struct repository *r,
 			pack_sig_ofs = p->pack_size - r->hash_algo->rawsz;
 		if (offset > pack_sig_ofs)
 			remaining -= (unsigned int)(offset - pack_sig_ofs);
-		r->hash_algo->update_fn(&ctx, in, remaining);
+		r->hash_algo->fn.update(&ctx, in, remaining);
 	} while (offset < pack_sig_ofs);
-	r->hash_algo->final_fn(hash, &ctx);
+	r->hash_algo->fn.final(hash, &ctx);
 	pack_sig = use_pack(p, w_curs, pack_sig_ofs, NULL);
 	if (!hasheq(hash, pack_sig, the_repository->hash_algo))
 		err = error("%s pack checksum mismatch",
diff --git a/pack-write.c b/pack-write.c
index 8c7dfddc5a..d202c6c9a4 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -392,8 +392,8 @@ void fixup_pack_header_footer(int pack_fd,
 	char *buf;
 	ssize_t read_result;
 
-	the_hash_algo->init_fn(&old_hash_ctx);
-	the_hash_algo->init_fn(&new_hash_ctx);
+	the_hash_algo->fn.init(&old_hash_ctx);
+	the_hash_algo->fn.init(&new_hash_ctx);
 
 	if (lseek(pack_fd, 0, SEEK_SET) != 0)
 		die_errno("Failed seeking to start of '%s'", pack_name);
@@ -405,9 +405,9 @@ void fixup_pack_header_footer(int pack_fd,
 			  pack_name);
 	if (lseek(pack_fd, 0, SEEK_SET) != 0)
 		die_errno("Failed seeking to start of '%s'", pack_name);
-	the_hash_algo->update_fn(&old_hash_ctx, &hdr, sizeof(hdr));
+	the_hash_algo->fn.update(&old_hash_ctx, &hdr, sizeof(hdr));
 	hdr.hdr_entries = htonl(object_count);
-	the_hash_algo->update_fn(&new_hash_ctx, &hdr, sizeof(hdr));
+	the_hash_algo->fn.update(&new_hash_ctx, &hdr, sizeof(hdr));
 	write_or_die(pack_fd, &hdr, sizeof(hdr));
 	partial_pack_offset -= sizeof(hdr);
 
@@ -422,7 +422,7 @@ void fixup_pack_header_footer(int pack_fd,
 			break;
 		if (n < 0)
 			die_errno("Failed to checksum '%s'", pack_name);
-		the_hash_algo->update_fn(&new_hash_ctx, buf, n);
+		the_hash_algo->fn.update(&new_hash_ctx, buf, n);
 
 		aligned_sz -= n;
 		if (!aligned_sz)
@@ -431,11 +431,11 @@ void fixup_pack_header_footer(int pack_fd,
 		if (!partial_pack_hash)
 			continue;
 
-		the_hash_algo->update_fn(&old_hash_ctx, buf, n);
+		the_hash_algo->fn.update(&old_hash_ctx, buf, n);
 		partial_pack_offset -= n;
 		if (partial_pack_offset == 0) {
 			unsigned char hash[GIT_MAX_RAWSZ];
-			the_hash_algo->final_fn(hash, &old_hash_ctx);
+			the_hash_algo->fn.final(hash, &old_hash_ctx);
 			if (!hasheq(hash, partial_pack_hash,
 				    the_repository->hash_algo))
 				die("Unexpected checksum for %s "
@@ -445,16 +445,16 @@ void fixup_pack_header_footer(int pack_fd,
 			 * pack, which also means making partial_pack_offset
 			 * big enough not to matter anymore.
 			 */
-			the_hash_algo->init_fn(&old_hash_ctx);
+			the_hash_algo->fn.init(&old_hash_ctx);
 			partial_pack_offset = ~partial_pack_offset;
 			partial_pack_offset -= MSB(partial_pack_offset, 1);
 		}
 	}
 	free(buf);
 
 	if (partial_pack_hash)
-		the_hash_algo->final_fn(partial_pack_hash, &old_hash_ctx);
-	the_hash_algo->final_fn(new_pack_hash, &new_hash_ctx);
+		the_hash_algo->fn.final(partial_pack_hash, &old_hash_ctx);
+	the_hash_algo->fn.final(new_pack_hash, &new_hash_ctx);
 	write_or_die(pack_fd, new_pack_hash, the_hash_algo->rawsz);
 	fsync_component_or_die(FSYNC_COMPONENT_PACK, pack_fd, pack_name);
 }
diff --git a/read-cache.c b/read-cache.c
index 01d0b3ad22..5170b07a33 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -1736,9 +1736,9 @@ static int verify_hdr(const struct cache_header *hdr, unsigned long size)
 	if (oideq(&oid, null_oid()))
 		return 0;
 
-	the_hash_algo->init_fn(&c);
-	the_hash_algo->update_fn(&c, hdr, size - the_hash_algo->rawsz);
-	the_hash_algo->final_fn(hash, &c);
+	the_hash_algo->fn.init(&c);
+	the_hash_algo->fn.update(&c, hdr, size - the_hash_algo->rawsz);
+	the_hash_algo->fn.final(hash, &c);
 	if (!hasheq(hash, start, the_repository->hash_algo))
 		return error(_("bad index file sha1 signature"));
 	return 0;
@@ -2574,8 +2574,8 @@ static int write_index_ext_header(struct hashfile *f,
 	if (eoie_f) {
 		ext = htonl(ext);
 		sz = htonl(sz);
-		the_hash_algo->update_fn(eoie_f, &ext, sizeof(ext));
-		the_hash_algo->update_fn(eoie_f, &sz, sizeof(sz));
+		the_hash_algo->fn.update(eoie_f, &ext, sizeof(ext));
+		the_hash_algo->fn.update(eoie_f, &sz, sizeof(sz));
 	}
 	return 0;
 }
@@ -2977,7 +2977,7 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 	 */
 	if (offset && record_eoie()) {
 		CALLOC_ARRAY(eoie_c, 1);
-		the_hash_algo->init_fn(eoie_c);
+		the_hash_algo->fn.init(eoie_c);
 	}
 
 	/*
@@ -3616,7 +3616,7 @@ static size_t read_eoie_extension(const char *mmap, size_t mmap_size)
 	 *	 "REUC" + <binary representation of M>)
 	 */
 	src_offset = offset;
-	the_hash_algo->init_fn(&c);
+	the_hash_algo->fn.init(&c);
 	while (src_offset < mmap_size - the_hash_algo->rawsz - EOIE_SIZE_WITH_HEADER) {
 		/* After an array of active_nr index entries,
 		 * there can be arbitrary number of extended
@@ -3632,12 +3632,12 @@ static size_t read_eoie_extension(const char *mmap, size_t mmap_size)
 		if (src_offset + 8 + extsize < src_offset)
 			return 0;
 
-		the_hash_algo->update_fn(&c, mmap + src_offset, 8);
+		the_hash_algo->fn.update(&c, mmap + src_offset, 8);
 
 		src_offset += 8;
 		src_offset += extsize;
 	}
-	the_hash_algo->final_fn(hash, &c);
+	the_hash_algo->fn.final(hash, &c);
 	if (!hasheq(hash, (const unsigned char *)index, the_repository->hash_algo))
 		return 0;
 
@@ -3658,7 +3658,7 @@ static void write_eoie_extension(struct strbuf *sb, git_hash_ctx *eoie_context,
 	strbuf_add(sb, &buffer, sizeof(uint32_t));
 
 	/* hash */
-	the_hash_algo->final_fn(hash, eoie_context);
+	the_hash_algo->fn.final(hash, eoie_context);
 	strbuf_add(sb, hash, the_hash_algo->rawsz);
 }
 
diff --git a/rerere.c b/rerere.c
index d01e98bf65..ef3b58b5c7 100644
--- a/rerere.c
+++ b/rerere.c
@@ -395,10 +395,10 @@ static int handle_conflict(struct strbuf *out, struct rerere_io *io,
 			strbuf_addbuf(out, &two);
 			rerere_strbuf_putconflict(out, '>', marker_size);
 			if (ctx) {
-				the_hash_algo->update_fn(ctx, one.buf ?
+				the_hash_algo->fn.update(ctx, one.buf ?
 							 one.buf : "",
 							 one.len + 1);
-				the_hash_algo->update_fn(ctx, two.buf ?
+				the_hash_algo->fn.update(ctx, two.buf ?
 							 two.buf : "",
 							 two.len + 1);
 			}
@@ -435,7 +435,7 @@ static int handle_path(unsigned char *hash, struct rerere_io *io, int marker_siz
 	struct strbuf buf = STRBUF_INIT, out = STRBUF_INIT;
 	int has_conflicts = 0;
 	if (hash)
-		the_hash_algo->init_fn(&ctx);
+		the_hash_algo->fn.init(&ctx);
 
 	while (!io->getline(&buf, io)) {
 		if (is_cmarker(buf.buf, '<', marker_size)) {
@@ -452,7 +452,7 @@ static int handle_path(unsigned char *hash, struct rerere_io *io, int marker_siz
 	strbuf_release(&out);
 
 	if (hash)
-		the_hash_algo->final_fn(hash, &ctx);
+		the_hash_algo->fn.final(hash, &ctx);
 
 	return has_conflicts;
 }
diff --git a/t/helper/test-hash-speed.c b/t/helper/test-hash-speed.c
index 7de822af51..1fd8cdac79 100644
--- a/t/helper/test-hash-speed.c
+++ b/t/helper/test-hash-speed.c
@@ -5,9 +5,9 @@
 
 static inline void compute_hash(const struct git_hash_algo *algo, git_hash_ctx *ctx, uint8_t *final, const void *p, size_t len)
 {
-	algo->init_fn(ctx);
-	algo->update_fn(ctx, p, len);
-	algo->final_fn(final, ctx);
+	algo->fn.init(ctx);
+	algo->fn.update(ctx, p, len);
+	algo->fn.final(final, ctx);
 }
 
 int cmd__hash_speed(int ac, const char **av)
diff --git a/t/helper/test-hash.c b/t/helper/test-hash.c
index 45d829c908..3e740bc4fb 100644
--- a/t/helper/test-hash.c
+++ b/t/helper/test-hash.c
@@ -27,7 +27,7 @@ int cmd_hash_impl(int ac, const char **av, int algo)
 			die("OOPS");
 	}
 
-	algop->init_fn(&ctx);
+	algop->fn.init(&ctx);
 
 	while (1) {
 		ssize_t sz, this_sz;
@@ -46,9 +46,9 @@ int cmd_hash_impl(int ac, const char **av, int algo)
 		}
 		if (this_sz == 0)
 			break;
-		algop->update_fn(&ctx, buffer, this_sz);
+		algop->fn.update(&ctx, buffer, this_sz);
 	}
-	algop->final_fn(hash, &ctx);
+	algop->fn.final(hash, &ctx);
 
 	if (binary)
 		fwrite(hash, 1, algop->rawsz, stdout);
diff --git a/t/unit-tests/t-hash.c b/t/unit-tests/t-hash.c
index e62647019b..9c1c9bf70b 100644
--- a/t/unit-tests/t-hash.c
+++ b/t/unit-tests/t-hash.c
@@ -15,9 +15,9 @@ static void check_hash_data(const void *data, size_t data_length,
 		unsigned char hash[GIT_MAX_HEXSZ];
 		const struct git_hash_algo *algop = &hash_algos[i];
 
-		algop->init_fn(&ctx);
-		algop->update_fn(&ctx, data, data_length);
-		algop->final_fn(hash, &ctx);
+		algop->fn.init(&ctx);
+		algop->fn.update(&ctx, data, data_length);
+		algop->fn.final(hash, &ctx);
 
 		if (!check_str(hash_to_hex_algop(hash, algop), expected_hashes[i - 1]))
 			test_msg("result does not match with the expected for %s\n", hash_algos[i].name);
diff --git a/trace2/tr2_sid.c b/trace2/tr2_sid.c
index 09c4ef0d17..a7b17abe2b 100644
--- a/trace2/tr2_sid.c
+++ b/trace2/tr2_sid.c
@@ -45,9 +45,9 @@ static void tr2_sid_append_my_sid_component(void)
 	if (xgethostname(hostname, sizeof(hostname)))
 		strbuf_add(&tr2sid_buf, "Localhost", 9);
 	else {
-		algo->init_fn(&ctx);
-		algo->update_fn(&ctx, hostname, strlen(hostname));
-		algo->final_fn(hash, &ctx);
+		algo->fn.init(&ctx);
+		algo->fn.update(&ctx, hostname, strlen(hostname));
+		algo->fn.final(hash, &ctx);
 		hash_to_hex_algop_r(hex, hash, algo);
 		strbuf_addch(&tr2sid_buf, 'H');
 		strbuf_add(&tr2sid_buf, hex, 8);

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/6] hash.h: introduce `unsafe_hash_algo()`
  2024-11-22  8:25       ` Jeff King
@ 2024-11-22 20:37         ` brian m. carlson
  0 siblings, 0 replies; 60+ messages in thread
From: brian m. carlson @ 2024-11-22 20:37 UTC (permalink / raw)
  To: Jeff King; +Cc: Taylor Blau, git, Elijah Newren, Junio C Hamano

[-- Attachment #1: Type: text/plain, Size: 1827 bytes --]

On 2024-11-22 at 08:25:23, Jeff King wrote:
> OK, here's a quick and dirty conversion that applies on top of Taylor's
> patches 1 and 2 (and then patches 3 and on would need to adjust to the
> new world order).
> 
> There are quite a lot of spots that needed to be adjusted, but the most
> part it was just s/->\([a-z]\)_fn/->fn.\1/.  (I'll leave adjusting the
> unsafe_ variants as an exercise for the reader).

Yeah, I saw that there were about 50 callers of the hash functions,
which on one hand isn't too many to change, but on the other hand also
surprised me.  I don't remember there being that many callers when I did
the SHA-256 work, but it could be that I'm remembering poorly.

> The naming convention was just what I made up, but one nice effect is
> that it replaces "_fn" with "fn.", so the line lengths remain the same.
> :) It does mean there's a function called "final", which is a keyword in
> C++. I don't know if that matters to us or not (I feel like we were
> trying to avoid those, but it appears as a function elsewhere, too).

It doesn't.  I think we had one contributor trying to make the code
compile as C++ at one point, but it definitely doesn't now.  A quick
`git grep` show that we have lots of variables named `new` as well.  (In
addition, the implicit cast of `void *`, such as from `malloc`, to a
pointer of any type is not allowed without a cast in C++, so that would
need to be handled, too.)

> So I dunno. It is a one-time pain, but I think the result harmonizes the
> type system with the concept better.

Yeah, I agree.  It definitely looks like it could mostly be done with a
Perl or Ruby script, even if slightly inconvenient.  And if there is a
conflict, it should be easy to fix up.
-- 
brian m. carlson (they/them or he/him)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/6] hash.h: introduce `unsafe_hash_algo()`
  2024-11-21  9:37   ` Jeff King
  2024-11-22  0:39     ` brian m. carlson
@ 2025-01-10 21:38     ` Taylor Blau
  2025-01-11  2:45       ` Jeff King
  1 sibling, 1 reply; 60+ messages in thread
From: Taylor Blau @ 2025-01-10 21:38 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Elijah Newren, Junio C Hamano, brian m. carlson

On Thu, Nov 21, 2024 at 04:37:31AM -0500, Jeff King wrote:
> If we don't care about the speed of this function, then an
> implementation like:
>
>   for (i = 0; i < GIT_HASH_NALGOS; i++) {
> 	if (p == &hash_algos[i] || p == hash_algos[i]->unsafe)
> 		return i;
>   }
>   return GIT_HASH_UNKNOWN;
>
> would work. I'm not sure if that would be measurable. I was surprised at
> the number of places that hash_algo_by_ptr() is called. Many low-level
> oid functions need it because we store the integer id there rather than
> a direct pointer (so oidread(), oidclr(), oid_object_info_extended(),
> and so on). But I'd also expect the loop above to be pretty fast. So I
> dunno.

Concerns about the speed aside (I agree that the for loop is likely to
be very fast, and will probably get unrolled by modern compilers), this
looks good to me with one small tweak.

We can't use `hash_algos[i]->unsafe` directly it might be NULL, so the
function as above would change the behavior of hash_algo_by_ptr()
slightly when provided NULL (it would return SHA-256 instead of
UNKNOWN).

So I think you'd want to write the loop like:

    size_t i;

    for (i = 0; i < GIT_HASH_NALGOS) {
        const struct git_hash_algo *algop = &hash_algos[i];

        if (p == algop || (algop->unsafe && p == algop->unsafe))
            return i;
    }

    return GIT_HASH_UNKNOWN;

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/6] hash.h: introduce `unsafe_hash_algo()`
  2025-01-10 21:38     ` Taylor Blau
@ 2025-01-11  2:45       ` Jeff King
  0 siblings, 0 replies; 60+ messages in thread
From: Jeff King @ 2025-01-11  2:45 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Elijah Newren, Junio C Hamano, brian m. carlson

On Fri, Jan 10, 2025 at 04:38:56PM -0500, Taylor Blau wrote:

> On Thu, Nov 21, 2024 at 04:37:31AM -0500, Jeff King wrote:
> > If we don't care about the speed of this function, then an
> > implementation like:
> >
> >   for (i = 0; i < GIT_HASH_NALGOS; i++) {
> > 	if (p == &hash_algos[i] || p == hash_algos[i]->unsafe)
> > 		return i;
> >   }
> >   return GIT_HASH_UNKNOWN;
> >
> > would work. I'm not sure if that would be measurable. I was surprised at
> > the number of places that hash_algo_by_ptr() is called. Many low-level
> > oid functions need it because we store the integer id there rather than
> > a direct pointer (so oidread(), oidclr(), oid_object_info_extended(),
> > and so on). But I'd also expect the loop above to be pretty fast. So I
> > dunno.
> 
> Concerns about the speed aside (I agree that the for loop is likely to
> be very fast, and will probably get unrolled by modern compilers), this
> looks good to me with one small tweak.

It should definitely be unrolled. The big change against the existing
code is that it has branches. But again, I'm not sure how much the
performance matters here (I would have naively said not at all, but it
does get called in lots of low-level spots).

> We can't use `hash_algos[i]->unsafe` directly it might be NULL, so the
> function as above would change the behavior of hash_algo_by_ptr()
> slightly when provided NULL (it would return SHA-256 instead of
> UNKNOWN).
> 
> So I think you'd want to write the loop like:
> 
>     size_t i;
> 
>     for (i = 0; i < GIT_HASH_NALGOS) {
>         const struct git_hash_algo *algop = &hash_algos[i];
> 
>         if (p == algop || (algop->unsafe && p == algop->unsafe))
>             return i;
>     }
> 
>     return GIT_HASH_UNKNOWN;

Yes, that works. Or maybe even just:

  if (!algop)
	return GIT_HASH_UNKNOWN;

at the top to cover the special case.

-Peff

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 4/6] csum-file.c: use unsafe_hash_algo()
  2024-11-20 19:13 [PATCH 0/6] hash: introduce unsafe_hash_algo(), drop unsafe_ variants Taylor Blau
                   ` (2 preceding siblings ...)
  2024-11-20 19:13 ` [PATCH 3/6] hash.h: introduce `unsafe_hash_algo()` Taylor Blau
@ 2024-11-20 19:13 ` Taylor Blau
  2024-11-20 19:13 ` [PATCH 5/6] t/helper/test-hash.c: " Taylor Blau
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 60+ messages in thread
From: Taylor Blau @ 2024-11-20 19:13 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Jeff King, Junio C Hamano, brian m. carlson

Instead of calling the unsafe_ hash function variants directly, make use
of the shared 'algop' pointer by initializing it to:

    f->algop = unsafe_hash_algo(the_hash_algo);

, thus making all calls use the unsafe variants directly.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 csum-file.c | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/csum-file.c b/csum-file.c
index 107bc4c96f9..fd51f5e432e 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -50,7 +50,7 @@ void hashflush(struct hashfile *f)
 
 	if (offset) {
 		if (!f->skip_hash)
-			f->algop->unsafe_update_fn(&f->ctx, f->buffer, offset);
+			f->algop->update_fn(&f->ctx, f->buffer, offset);
 		flush(f, f->buffer, offset);
 		f->offset = 0;
 	}
@@ -73,7 +73,7 @@ int finalize_hashfile(struct hashfile *f, unsigned char *result,
 	if (f->skip_hash)
 		hashclr(f->buffer, f->algop);
 	else
-		f->algop->unsafe_final_fn(f->buffer, &f->ctx);
+		f->algop->final_fn(f->buffer, &f->ctx);
 
 	if (result)
 		hashcpy(result, f->buffer, f->algop);
@@ -128,7 +128,7 @@ void hashwrite(struct hashfile *f, const void *buf, unsigned int count)
 			 * f->offset is necessarily zero.
 			 */
 			if (!f->skip_hash)
-				f->algop->unsafe_update_fn(&f->ctx, buf, nr);
+				f->algop->update_fn(&f->ctx, buf, nr);
 			flush(f, buf, nr);
 		} else {
 			/*
@@ -175,8 +175,8 @@ static struct hashfile *hashfd_internal(int fd, const char *name,
 	f->do_crc = 0;
 	f->skip_hash = 0;
 
-	f->algop = the_hash_algo;
-	f->algop->unsafe_init_fn(&f->ctx);
+	f->algop = unsafe_hash_algo(the_hash_algo);
+	f->algop->init_fn(&f->ctx);
 
 	f->buffer_len = buffer_len;
 	f->buffer = xmalloc(buffer_len);
@@ -210,7 +210,7 @@ void hashfile_checkpoint(struct hashfile *f, struct hashfile_checkpoint *checkpo
 {
 	hashflush(f);
 	checkpoint->offset = f->total;
-	f->algop->unsafe_clone_fn(&checkpoint->ctx, &f->ctx);
+	f->algop->clone_fn(&checkpoint->ctx, &f->ctx);
 }
 
 int hashfile_truncate(struct hashfile *f, struct hashfile_checkpoint *checkpoint)
@@ -221,7 +221,7 @@ int hashfile_truncate(struct hashfile *f, struct hashfile_checkpoint *checkpoint
 	    lseek(f->fd, offset, SEEK_SET) != offset)
 		return -1;
 	f->total = offset;
-	f->algop->unsafe_clone_fn(&f->ctx, &checkpoint->ctx);
+	f->algop->clone_fn(&f->ctx, &checkpoint->ctx);
 	f->offset = 0; /* hashflush() was called in checkpoint */
 	return 0;
 }
@@ -242,15 +242,15 @@ int hashfile_checksum_valid(const unsigned char *data, size_t total_len)
 {
 	unsigned char got[GIT_MAX_RAWSZ];
 	git_hash_ctx ctx;
-	const struct git_hash_algo *algop = the_hash_algo;
+	const struct git_hash_algo *algop = unsafe_hash_algo(the_hash_algo);
 	size_t data_len = total_len - algop->rawsz;
 
 	if (total_len < algop->rawsz)
 		return 0; /* say "too short"? */
 
-	algop->unsafe_init_fn(&ctx);
-	algop->unsafe_update_fn(&ctx, data, data_len);
-	algop->unsafe_final_fn(got, &ctx);
+	algop->init_fn(&ctx);
+	algop->update_fn(&ctx, data, data_len);
+	algop->final_fn(got, &ctx);
 
 	return hasheq(got, data + data_len, algop);
 }
-- 
2.47.0.237.gc601277f4c4


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 5/6] t/helper/test-hash.c: use unsafe_hash_algo()
  2024-11-20 19:13 [PATCH 0/6] hash: introduce unsafe_hash_algo(), drop unsafe_ variants Taylor Blau
                   ` (3 preceding siblings ...)
  2024-11-20 19:13 ` [PATCH 4/6] csum-file.c: use unsafe_hash_algo() Taylor Blau
@ 2024-11-20 19:13 ` Taylor Blau
  2024-11-20 19:13 ` [PATCH 6/6] hash.h: drop unsafe_ function variants Taylor Blau
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 60+ messages in thread
From: Taylor Blau @ 2024-11-20 19:13 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Jeff King, Junio C Hamano, brian m. carlson

Remove a series of conditionals within the shared cmd_hash_impl() helper
that powers the 'sha1' and 'sha1-unsafe' helpers.

Instead, replace them with a single conditional that transforms the
specified hash algorithm into its unsafe variant. Then all subsequent
calls can directly use whatever function it wants to call without having
to decide between the safe and unsafe variants.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/helper/test-hash.c | 17 +++++------------
 1 file changed, 5 insertions(+), 12 deletions(-)

diff --git a/t/helper/test-hash.c b/t/helper/test-hash.c
index d0ee668df95..aa82638c621 100644
--- a/t/helper/test-hash.c
+++ b/t/helper/test-hash.c
@@ -9,6 +9,8 @@ int cmd_hash_impl(int ac, const char **av, int algo, int unsafe)
 	int binary = 0;
 	char *buffer;
 	const struct git_hash_algo *algop = &hash_algos[algo];
+	if (unsafe)
+		algop = unsafe_hash_algo(algop);
 
 	if (ac == 2) {
 		if (!strcmp(av[1], "-b"))
@@ -27,10 +29,7 @@ int cmd_hash_impl(int ac, const char **av, int algo, int unsafe)
 			die("OOPS");
 	}
 
-	if (unsafe)
-		algop->unsafe_init_fn(&ctx);
-	else
-		algop->init_fn(&ctx);
+	algop->init_fn(&ctx);
 
 	while (1) {
 		ssize_t sz, this_sz;
@@ -49,15 +48,9 @@ int cmd_hash_impl(int ac, const char **av, int algo, int unsafe)
 		}
 		if (this_sz == 0)
 			break;
-		if (unsafe)
-			algop->unsafe_update_fn(&ctx, buffer, this_sz);
-		else
-			algop->update_fn(&ctx, buffer, this_sz);
+		algop->update_fn(&ctx, buffer, this_sz);
 	}
-	if (unsafe)
-		algop->unsafe_final_fn(hash, &ctx);
-	else
-		algop->final_fn(hash, &ctx);
+	algop->final_fn(hash, &ctx);
 
 	if (binary)
 		fwrite(hash, 1, algop->rawsz, stdout);
-- 
2.47.0.237.gc601277f4c4


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 6/6] hash.h: drop unsafe_ function variants
  2024-11-20 19:13 [PATCH 0/6] hash: introduce unsafe_hash_algo(), drop unsafe_ variants Taylor Blau
                   ` (4 preceding siblings ...)
  2024-11-20 19:13 ` [PATCH 5/6] t/helper/test-hash.c: " Taylor Blau
@ 2024-11-20 19:13 ` Taylor Blau
  2024-11-21  9:41   ` Jeff King
  2025-01-08 19:14 ` [PATCH v2 0/8] hash: introduce unsafe_hash_algo(), drop unsafe_ variants Taylor Blau
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 60+ messages in thread
From: Taylor Blau @ 2024-11-20 19:13 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Jeff King, Junio C Hamano, brian m. carlson

Now that all callers have been converted from:

    the_hash_algo->unsafe_init_fn();

to

    unsafe_hash_algo(the_hash_algo)->unsafe_init_fn();

and similar, we can remove the scaffolding for the unsafe_ function
variants and force callers to use the new unsafe_hash_algo() mechanic
instead.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 hash.h        | 15 ---------------
 object-file.c | 15 ---------------
 2 files changed, 30 deletions(-)

diff --git a/hash.h b/hash.h
index 23cf6876e50..6dcbf6ab835 100644
--- a/hash.h
+++ b/hash.h
@@ -282,21 +282,6 @@ struct git_hash_algo {
 	/* The hash finalization function for object IDs. */
 	git_hash_final_oid_fn final_oid_fn;
 
-	/* The non-cryptographic hash initialization function. */
-	git_hash_init_fn unsafe_init_fn;
-
-	/* The non-cryptographic hash context cloning function. */
-	git_hash_clone_fn unsafe_clone_fn;
-
-	/* The non-cryptographic hash update function. */
-	git_hash_update_fn unsafe_update_fn;
-
-	/* The non-cryptographic hash finalization function. */
-	git_hash_final_fn unsafe_final_fn;
-
-	/* The non-cryptographic hash finalization function. */
-	git_hash_final_oid_fn unsafe_final_oid_fn;
-
 	/* The OID of the empty tree. */
 	const struct object_id *empty_tree;
 
diff --git a/object-file.c b/object-file.c
index fddcdbe9ba6..1040a5408f2 100644
--- a/object-file.c
+++ b/object-file.c
@@ -232,11 +232,6 @@ const struct git_hash_algo hash_algos[GIT_HASH_NALGOS] = {
 		.update_fn = git_hash_unknown_update,
 		.final_fn = git_hash_unknown_final,
 		.final_oid_fn = git_hash_unknown_final_oid,
-		.unsafe_init_fn = git_hash_unknown_init,
-		.unsafe_clone_fn = git_hash_unknown_clone,
-		.unsafe_update_fn = git_hash_unknown_update,
-		.unsafe_final_fn = git_hash_unknown_final,
-		.unsafe_final_oid_fn = git_hash_unknown_final_oid,
 		.empty_tree = NULL,
 		.empty_blob = NULL,
 		.null_oid = NULL,
@@ -252,11 +247,6 @@ const struct git_hash_algo hash_algos[GIT_HASH_NALGOS] = {
 		.update_fn = git_hash_sha1_update,
 		.final_fn = git_hash_sha1_final,
 		.final_oid_fn = git_hash_sha1_final_oid,
-		.unsafe_init_fn = git_hash_sha1_init_unsafe,
-		.unsafe_clone_fn = git_hash_sha1_clone_unsafe,
-		.unsafe_update_fn = git_hash_sha1_update_unsafe,
-		.unsafe_final_fn = git_hash_sha1_final_unsafe,
-		.unsafe_final_oid_fn = git_hash_sha1_final_oid_unsafe,
 		.unsafe = &sha1_unsafe_algo,
 		.empty_tree = &empty_tree_oid,
 		.empty_blob = &empty_blob_oid,
@@ -273,11 +263,6 @@ const struct git_hash_algo hash_algos[GIT_HASH_NALGOS] = {
 		.update_fn = git_hash_sha256_update,
 		.final_fn = git_hash_sha256_final,
 		.final_oid_fn = git_hash_sha256_final_oid,
-		.unsafe_init_fn = git_hash_sha256_init,
-		.unsafe_clone_fn = git_hash_sha256_clone,
-		.unsafe_update_fn = git_hash_sha256_update,
-		.unsafe_final_fn = git_hash_sha256_final,
-		.unsafe_final_oid_fn = git_hash_sha256_final_oid,
 		.empty_tree = &empty_tree_oid_sha256,
 		.empty_blob = &empty_blob_oid_sha256,
 		.null_oid = &null_oid_sha256,
-- 
2.47.0.237.gc601277f4c4

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH 6/6] hash.h: drop unsafe_ function variants
  2024-11-20 19:13 ` [PATCH 6/6] hash.h: drop unsafe_ function variants Taylor Blau
@ 2024-11-21  9:41   ` Jeff King
  0 siblings, 0 replies; 60+ messages in thread
From: Jeff King @ 2024-11-21  9:41 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Elijah Newren, Junio C Hamano, brian m. carlson

On Wed, Nov 20, 2024 at 02:13:59PM -0500, Taylor Blau wrote:

> Now that all callers have been converted from:
> 
>     the_hash_algo->unsafe_init_fn();
> 
> to
> 
>     unsafe_hash_algo(the_hash_algo)->unsafe_init_fn();
> 
> and similar, we can remove the scaffolding for the unsafe_ function
> variants and force callers to use the new unsafe_hash_algo() mechanic
> instead.

Nice. Especially for sha256, which does not even need to care about this
unsafe thing at all (so in 2099, when we finally remove sha1 support,
this whole system can go away!).

I think this also opens up alternatives for how we conditionally compile
things. E.g., if you have no *_SHA1_UNSAFE macro defined, we could avoid
defining sha1_unsafe_algo at all, and just leave it as NULL. I can't
think of a significant enough advantage to merit the work in converting
to that, though, so it's probably not worth doing unless we later decide
it would make things simpler for some reason.

-Peff

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v2 0/8] hash: introduce unsafe_hash_algo(), drop unsafe_ variants
  2024-11-20 19:13 [PATCH 0/6] hash: introduce unsafe_hash_algo(), drop unsafe_ variants Taylor Blau
                   ` (5 preceding siblings ...)
  2024-11-20 19:13 ` [PATCH 6/6] hash.h: drop unsafe_ function variants Taylor Blau
@ 2025-01-08 19:14 ` Taylor Blau
  2025-01-08 19:14   ` [PATCH v2 1/8] t/helper/test-tool: implement sha1-unsafe helper Taylor Blau
                     ` (9 more replies)
  2025-01-17 22:03 ` [PATCH v3 " Taylor Blau
  2025-01-23 17:34 ` [PATCH v4 " Taylor Blau
  8 siblings, 10 replies; 60+ messages in thread
From: Taylor Blau @ 2025-01-08 19:14 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Junio C Hamano, Elijah Newren, Patrick Steinhardt

(This series is rebased on 'master', which is 14650065b7
(RelNotes/2.48.0: fix typos etc., 2025-01-07) at the time of writing).

The bulk of this series is unchanged since last time, but a new seventh
patch that further hardens the hashfile_checkpoint callers on top of
Patrick's recent series[1].

The original cover letter is as follows:

------------

This series implements an idea discussed in [2] which suggests that we
introduce a way to access a wrapped version of a 'struct git_hash_algo'
which represents the unsafe variant of that algorithm, rather than
having individual unsafe_ functions (like unsafe_init_fn() versus
init_fn(), etc.).

This approach is relatively straightforward to implement, and removes a
significant deficiency in the original implementation of
unsafe/non-cryptographic hash functions by making it impossible to
switch between safe- and unsafe variants of hash functions. It also
cleans up the sha1-unsafe test helper's implementation by removing a
large number of "if (unsafe)"-style conditionals.

The series is laid out as follows:

  * The first two patches prepare the hashfile API for the upcoming
    change:

      csum-file: store the hash algorithm as a struct field
      csum-file.c: extract algop from hashfile_checksum_valid()

  * The next patch implements the new 'unsafe_hash_algo()' function at
    the heart of this series' approach:

      hash.h: introduce `unsafe_hash_algo()`

  * The next two patches convert existing callers to use the new
    'unsafe_hash_algo()' function, instead of switching between safe and
    unsafe_ variants of individual functions:

      csum-file.c: use unsafe_hash_algo()
      t/helper/test-hash.c: use unsafe_hash_algo()

  * The final patch drops the unsafe_ function variants following all
    callers being converted to use the new pattern:

      hash.h: drop unsafe_ function variants

Thanks in advance for your review!

[1]: https://lore.kernel.org/git/20241230-pks-meson-sha1-unsafe-v1-0-efb276e171f5@pks.im/
[2]: https://lore.kernel.org/git/20241107013915.GA961214@coredump.intra.peff.net/

Taylor Blau (8):
  t/helper/test-tool: implement sha1-unsafe helper
  csum-file: store the hash algorithm as a struct field
  csum-file.c: extract algop from hashfile_checksum_valid()
  hash.h: introduce `unsafe_hash_algo()`
  csum-file.c: use unsafe_hash_algo()
  t/helper/test-hash.c: use unsafe_hash_algo()
  csum-file: introduce hashfile_checkpoint_init()
  hash.h: drop unsafe_ function variants

 builtin/fast-import.c  |  2 +-
 bulk-checkin.c         |  9 ++++++---
 csum-file.c            | 40 +++++++++++++++++++++++++---------------
 csum-file.h            |  2 ++
 hash.h                 | 20 +++++---------------
 object-file.c          | 41 ++++++++++++++++++++++++++---------------
 t/helper/test-hash.c   |  4 +++-
 t/helper/test-sha1.c   |  7 ++++++-
 t/helper/test-sha1.sh  | 38 ++++++++++++++++++++++----------------
 t/helper/test-sha256.c |  2 +-
 t/helper/test-tool.c   |  1 +
 t/helper/test-tool.h   |  3 ++-
 12 files changed, 100 insertions(+), 69 deletions(-)

Range-diff against v1:
2:  d8c1fc78b57 ! 1:  4c1523a04f1 t/helper/test-tool: implement sha1-unsafe helper
    @@ Metadata
      ## Commit message ##
         t/helper/test-tool: implement sha1-unsafe helper
     
    -    Add a new helper similar to 't/helper/test-tool sha1' called instead
    -    "sha1-unsafe" which uses the unsafe variant of Git's SHA-1 wrappers.
    +    With the new "unsafe" SHA-1 build knob, it is convenient to have a
    +    test-tool that can exercise Git's unsafe SHA-1 wrappers for testing,
    +    similar to 't/helper/test-tool sha1'.
     
    -    While we're at it, modify the test-sha1.sh script to exercise both
    -    the sha1 and sha1-unsafe test tools to ensure that both produce the
    -    expected hash values.
    +    Implement that helper by altering the implementation of that test-tool
    +    (in cmd_hash_impl(), which is generic and parameterized over different
    +    hash functions) to conditionally run the unsafe variants of the chosen
    +    hash function, and expose the new behavior via a new 'sha1-unsafe' test
    +    helper.
     
         Signed-off-by: Taylor Blau <me@ttaylorr.com>
     
    + ## t/helper/test-hash.c ##
    +@@
    + #include "test-tool.h"
    + #include "hex.h"
    + 
    +-int cmd_hash_impl(int ac, const char **av, int algo)
    ++int cmd_hash_impl(int ac, const char **av, int algo, int unsafe)
    + {
    + 	git_hash_ctx ctx;
    + 	unsigned char hash[GIT_MAX_HEXSZ];
    +@@ t/helper/test-hash.c: int cmd_hash_impl(int ac, const char **av, int algo)
    + 			die("OOPS");
    + 	}
    + 
    +-	algop->init_fn(&ctx);
    ++	if (unsafe)
    ++		algop->unsafe_init_fn(&ctx);
    ++	else
    ++		algop->init_fn(&ctx);
    + 
    + 	while (1) {
    + 		ssize_t sz, this_sz;
    +@@ t/helper/test-hash.c: int cmd_hash_impl(int ac, const char **av, int algo)
    + 		}
    + 		if (this_sz == 0)
    + 			break;
    +-		algop->update_fn(&ctx, buffer, this_sz);
    ++		if (unsafe)
    ++			algop->unsafe_update_fn(&ctx, buffer, this_sz);
    ++		else
    ++			algop->update_fn(&ctx, buffer, this_sz);
    + 	}
    +-	algop->final_fn(hash, &ctx);
    ++	if (unsafe)
    ++		algop->unsafe_final_fn(hash, &ctx);
    ++	else
    ++		algop->final_fn(hash, &ctx);
    + 
    + 	if (binary)
    + 		fwrite(hash, 1, algop->rawsz, stdout);
    +
      ## t/helper/test-sha1.c ##
    +@@
    + 
    + int cmd__sha1(int ac, const char **av)
    + {
    +-	return cmd_hash_impl(ac, av, GIT_HASH_SHA1);
    ++	return cmd_hash_impl(ac, av, GIT_HASH_SHA1, 0);
    + }
    + 
    + int cmd__sha1_is_sha1dc(int argc UNUSED, const char **argv UNUSED)
     @@ t/helper/test-sha1.c: int cmd__sha1_is_sha1dc(int argc UNUSED, const char **argv UNUSED)
      #endif
      	return 1;
    @@ t/helper/test-sha1.sh
      da39a3ee5e6b4b0d3255bfef95601890afd80709 0
      3f786850e387550fdab836ed7e6dc881de23001b 0 a
     
    + ## t/helper/test-sha256.c ##
    +@@
    + 
    + int cmd__sha256(int ac, const char **av)
    + {
    +-	return cmd_hash_impl(ac, av, GIT_HASH_SHA256);
    ++	return cmd_hash_impl(ac, av, GIT_HASH_SHA256, 0);
    + }
    +
      ## t/helper/test-tool.c ##
     @@ t/helper/test-tool.c: static struct test_cmd cmds[] = {
      	{ "serve-v2", cmd__serve_v2 },
    @@ t/helper/test-tool.h: int cmd__scrap_cache_tree(int argc, const char **argv);
      int cmd__sha256(int argc, const char **argv);
      int cmd__sigchain(int argc, const char **argv);
      int cmd__simple_ipc(int argc, const char **argv);
    +@@ t/helper/test-tool.h: int cmd__windows_named_pipe(int argc, const char **argv);
    + #endif
    + int cmd__write_cache(int argc, const char **argv);
    + 
    +-int cmd_hash_impl(int ac, const char **av, int algo);
    ++int cmd_hash_impl(int ac, const char **av, int algo, int unsafe);
    + 
    + #endif
3:  380133a1142 = 2:  99cc44895b5 csum-file: store the hash algorithm as a struct field
4:  e5076f003bf = 3:  1ffab2f8289 csum-file.c: extract algop from hashfile_checksum_valid()
5:  17f92dba34b = 4:  99dcbe2e716 hash.h: introduce `unsafe_hash_algo()`
6:  0d8e12599e2 = 5:  2dcc2aa6803 csum-file.c: use unsafe_hash_algo()
7:  a49a41703e2 = 6:  a2b9ef03080 t/helper/test-hash.c: use unsafe_hash_algo()
1:  0e2fcee6894 ! 7:  94c07fd8a55 t/helper/test-sha1: prepare for an unsafe mode
    @@ Metadata
     Author: Taylor Blau <me@ttaylorr.com>
     
      ## Commit message ##
    -    t/helper/test-sha1: prepare for an unsafe mode
    +    csum-file: introduce hashfile_checkpoint_init()
     
    -    With the new "unsafe" SHA-1 build knob, it would be convenient to have
    -    a test-tool that can exercise Git's unsafe SHA-1 wrappers for testing,
    -    similar to 't/helper/test-tool sha1'.
    +    In 106140a99f (builtin/fast-import: fix segfault with unsafe SHA1
    +    backend, 2024-12-30) and 9218c0bfe1 (bulk-checkin: fix segfault with
    +    unsafe SHA1 backend, 2024-12-30), we observed the effects of failing to
    +    initialize a hashfile_checkpoint with the same hash function
    +    implementation as is used by the hashfile it is used to checkpoint.
     
    -    Prepare for such a helper by altering the implementation of that
    -    test-tool (in cmd_hash_impl(), which is generic and parameterized over
    -    different hash functions) to conditionally run the unsafe variants of
    -    the chosen hash function.
    +    While both 106140a99f and 9218c0bfe1 work around the immediate crash,
    +    changing the hash function implementation within the hashfile API to,
    +    for example, the non-unsafe variant would re-introduce the crash. This
    +    is a result of the tight coupling between initializing hashfiles and
    +    hashfile_checkpoints.
     
    -    The following commit will add a new test-tool which makes use of this
    -    new parameter.
    +    Introduce and use a new function which ensures that both parts of a
    +    hashfile and hashfile_checkpoint pair use the same hash function
    +    implementation to avoid such crashes.
     
    +    A few things worth noting:
    +
    +      - In the change to builtin/fast-import.c::stream_blob(), we can see
    +        that by removing the explicit reference to
    +        'the_hash_algo->unsafe_init_fn()', we are hardened against the
    +        hashfile API changing away from the_hash_algo (or its unsafe
    +        variant) in the future.
    +
    +      - The bulk-checkin code no longer needs to explicitly zero-initialize
    +        the hashfile_checkpoint, since it is now done as a result of calling
    +        'hashfile_checkpoint_init()'.
    +
    +      - Also in the bulk-checkin code, we add an additional call to
    +        prepare_to_stream() outside of the main loop in order to initialize
    +        'state->f' so we know which hash function implementation to use when
    +        calling 'hashfile_checkpoint_init()'.
    +
    +        This is OK, since subsequent 'prepare_to_stream()' calls are noops.
    +        However, we only need to call 'prepare_to_stream()' when we have the
    +        HASH_WRITE_OBJECT bit set in our flags. Without that bit, calling
    +        'prepare_to_stream()' does not assign 'state->f', so we have nothing
    +        to initialize.
    +
    +      - Other uses of the 'checkpoint' in 'deflate_blob_to_pack()' are
    +        appropriately guarded.
    +
    +    Helped-by: Patrick Steinhardt <ps@pks.im>
         Signed-off-by: Taylor Blau <me@ttaylorr.com>
     
    - ## t/helper/test-hash.c ##
    -@@
    - #include "test-tool.h"
    - #include "hex.h"
    + ## builtin/fast-import.c ##
    +@@ builtin/fast-import.c: static void stream_blob(uintmax_t len, struct object_id *oidout, uintmax_t mark)
    + 		|| (pack_size + PACK_SIZE_THRESHOLD + len) < pack_size)
    + 		cycle_packfile();
      
    --int cmd_hash_impl(int ac, const char **av, int algo)
    -+int cmd_hash_impl(int ac, const char **av, int algo, int unsafe)
    - {
    +-	the_hash_algo->unsafe_init_fn(&checkpoint.ctx);
    ++	hashfile_checkpoint_init(pack_file, &checkpoint);
    + 	hashfile_checkpoint(pack_file, &checkpoint);
    + 	offset = checkpoint.offset;
    + 
    +
    + ## bulk-checkin.c ##
    +@@ bulk-checkin.c: static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
      	git_hash_ctx ctx;
    - 	unsigned char hash[GIT_MAX_HEXSZ];
    -@@ t/helper/test-hash.c: int cmd_hash_impl(int ac, const char **av, int algo)
    - 			die("OOPS");
    - 	}
    + 	unsigned char obuf[16384];
    + 	unsigned header_len;
    +-	struct hashfile_checkpoint checkpoint = {0};
    ++	struct hashfile_checkpoint checkpoint;
    + 	struct pack_idx_entry *idx = NULL;
      
    --	algop->init_fn(&ctx);
    -+	if (unsafe)
    -+		algop->unsafe_init_fn(&ctx);
    -+	else
    -+		algop->init_fn(&ctx);
    + 	seekback = lseek(fd, 0, SEEK_CUR);
    +@@ bulk-checkin.c: static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
    + 					  OBJ_BLOB, size);
    + 	the_hash_algo->init_fn(&ctx);
    + 	the_hash_algo->update_fn(&ctx, obuf, header_len);
    +-	the_hash_algo->unsafe_init_fn(&checkpoint.ctx);
    + 
    + 	/* Note: idx is non-NULL when we are writing */
    +-	if ((flags & HASH_WRITE_OBJECT) != 0)
    ++	if ((flags & HASH_WRITE_OBJECT) != 0) {
    + 		CALLOC_ARRAY(idx, 1);
    + 
    ++		prepare_to_stream(state, flags);
    ++		hashfile_checkpoint_init(state->f, &checkpoint);
    ++	}
    ++
    + 	already_hashed_to = 0;
      
      	while (1) {
    - 		ssize_t sz, this_sz;
    -@@ t/helper/test-hash.c: int cmd_hash_impl(int ac, const char **av, int algo)
    - 		}
    - 		if (this_sz == 0)
    - 			break;
    --		algop->update_fn(&ctx, buffer, this_sz);
    -+		if (unsafe)
    -+			algop->unsafe_update_fn(&ctx, buffer, this_sz);
    -+		else
    -+			algop->update_fn(&ctx, buffer, this_sz);
    - 	}
    --	algop->final_fn(hash, &ctx);
    -+	if (unsafe)
    -+		algop->unsafe_final_fn(hash, &ctx);
    -+	else
    -+		algop->final_fn(hash, &ctx);
    - 
    - 	if (binary)
    - 		fwrite(hash, 1, algop->rawsz, stdout);
     
    - ## t/helper/test-sha1.c ##
    -@@
    - 
    - int cmd__sha1(int ac, const char **av)
    - {
    --	return cmd_hash_impl(ac, av, GIT_HASH_SHA1);
    -+	return cmd_hash_impl(ac, av, GIT_HASH_SHA1, 0);
    + ## csum-file.c ##
    +@@ csum-file.c: struct hashfile *hashfd_throughput(int fd, const char *name, struct progress *tp
    + 	return hashfd_internal(fd, name, tp, 8 * 1024);
      }
      
    - int cmd__sha1_is_sha1dc(int argc UNUSED, const char **argv UNUSED)
    -
    - ## t/helper/test-sha256.c ##
    -@@
    - 
    - int cmd__sha256(int ac, const char **av)
    ++void hashfile_checkpoint_init(struct hashfile *f,
    ++			      struct hashfile_checkpoint *checkpoint)
    ++{
    ++	memset(checkpoint, 0, sizeof(*checkpoint));
    ++	f->algop->init_fn(&checkpoint->ctx);
    ++}
    ++
    + void hashfile_checkpoint(struct hashfile *f, struct hashfile_checkpoint *checkpoint)
      {
    --	return cmd_hash_impl(ac, av, GIT_HASH_SHA256);
    -+	return cmd_hash_impl(ac, av, GIT_HASH_SHA256, 0);
    - }
    + 	hashflush(f);
     
    - ## t/helper/test-tool.h ##
    -@@ t/helper/test-tool.h: int cmd__windows_named_pipe(int argc, const char **argv);
    - #endif
    - int cmd__write_cache(int argc, const char **argv);
    + ## csum-file.h ##
    +@@ csum-file.h: struct hashfile_checkpoint {
    + 	git_hash_ctx ctx;
    + };
      
    --int cmd_hash_impl(int ac, const char **av, int algo);
    -+int cmd_hash_impl(int ac, const char **av, int algo, int unsafe);
    ++void hashfile_checkpoint_init(struct hashfile *, struct hashfile_checkpoint *);
    + void hashfile_checkpoint(struct hashfile *, struct hashfile_checkpoint *);
    + int hashfile_truncate(struct hashfile *, struct hashfile_checkpoint *);
      
    - #endif
8:  4081ad08549 = 8:  f5579883816 hash.h: drop unsafe_ function variants

base-commit: 14650065b76b28d3cfa9453356ac5669b19e706e
-- 
2.48.0.rc2.33.gaab3d23ed4c

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v2 1/8] t/helper/test-tool: implement sha1-unsafe helper
  2025-01-08 19:14 ` [PATCH v2 0/8] hash: introduce unsafe_hash_algo(), drop unsafe_ variants Taylor Blau
@ 2025-01-08 19:14   ` Taylor Blau
  2025-01-08 19:14   ` [PATCH v2 2/8] csum-file: store the hash algorithm as a struct field Taylor Blau
                     ` (8 subsequent siblings)
  9 siblings, 0 replies; 60+ messages in thread
From: Taylor Blau @ 2025-01-08 19:14 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Junio C Hamano, Elijah Newren, Patrick Steinhardt

With the new "unsafe" SHA-1 build knob, it is convenient to have a
test-tool that can exercise Git's unsafe SHA-1 wrappers for testing,
similar to 't/helper/test-tool sha1'.

Implement that helper by altering the implementation of that test-tool
(in cmd_hash_impl(), which is generic and parameterized over different
hash functions) to conditionally run the unsafe variants of the chosen
hash function, and expose the new behavior via a new 'sha1-unsafe' test
helper.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/helper/test-hash.c   | 17 +++++++++++++----
 t/helper/test-sha1.c   |  7 ++++++-
 t/helper/test-sha1.sh  | 38 ++++++++++++++++++++++----------------
 t/helper/test-sha256.c |  2 +-
 t/helper/test-tool.c   |  1 +
 t/helper/test-tool.h   |  3 ++-
 6 files changed, 45 insertions(+), 23 deletions(-)

diff --git a/t/helper/test-hash.c b/t/helper/test-hash.c
index 45d829c908f..d0ee668df95 100644
--- a/t/helper/test-hash.c
+++ b/t/helper/test-hash.c
@@ -1,7 +1,7 @@
 #include "test-tool.h"
 #include "hex.h"
 
-int cmd_hash_impl(int ac, const char **av, int algo)
+int cmd_hash_impl(int ac, const char **av, int algo, int unsafe)
 {
 	git_hash_ctx ctx;
 	unsigned char hash[GIT_MAX_HEXSZ];
@@ -27,7 +27,10 @@ int cmd_hash_impl(int ac, const char **av, int algo)
 			die("OOPS");
 	}
 
-	algop->init_fn(&ctx);
+	if (unsafe)
+		algop->unsafe_init_fn(&ctx);
+	else
+		algop->init_fn(&ctx);
 
 	while (1) {
 		ssize_t sz, this_sz;
@@ -46,9 +49,15 @@ int cmd_hash_impl(int ac, const char **av, int algo)
 		}
 		if (this_sz == 0)
 			break;
-		algop->update_fn(&ctx, buffer, this_sz);
+		if (unsafe)
+			algop->unsafe_update_fn(&ctx, buffer, this_sz);
+		else
+			algop->update_fn(&ctx, buffer, this_sz);
 	}
-	algop->final_fn(hash, &ctx);
+	if (unsafe)
+		algop->unsafe_final_fn(hash, &ctx);
+	else
+		algop->final_fn(hash, &ctx);
 
 	if (binary)
 		fwrite(hash, 1, algop->rawsz, stdout);
diff --git a/t/helper/test-sha1.c b/t/helper/test-sha1.c
index e60d000c039..349540c4df8 100644
--- a/t/helper/test-sha1.c
+++ b/t/helper/test-sha1.c
@@ -3,7 +3,7 @@
 
 int cmd__sha1(int ac, const char **av)
 {
-	return cmd_hash_impl(ac, av, GIT_HASH_SHA1);
+	return cmd_hash_impl(ac, av, GIT_HASH_SHA1, 0);
 }
 
 int cmd__sha1_is_sha1dc(int argc UNUSED, const char **argv UNUSED)
@@ -13,3 +13,8 @@ int cmd__sha1_is_sha1dc(int argc UNUSED, const char **argv UNUSED)
 #endif
 	return 1;
 }
+
+int cmd__sha1_unsafe(int ac, const char **av)
+{
+	return cmd_hash_impl(ac, av, GIT_HASH_SHA1, 1);
+}
diff --git a/t/helper/test-sha1.sh b/t/helper/test-sha1.sh
index 84594885c70..bf387d3db14 100755
--- a/t/helper/test-sha1.sh
+++ b/t/helper/test-sha1.sh
@@ -3,25 +3,31 @@
 dd if=/dev/zero bs=1048576 count=100 2>/dev/null |
 /usr/bin/time t/helper/test-tool sha1 >/dev/null
 
+dd if=/dev/zero bs=1048576 count=100 2>/dev/null |
+/usr/bin/time t/helper/test-tool sha1-unsafe >/dev/null
+
 while read expect cnt pfx
 do
 	case "$expect" in '#'*) continue ;; esac
-	actual=$(
-		{
-			test -z "$pfx" || echo "$pfx"
-			dd if=/dev/zero bs=1048576 count=$cnt 2>/dev/null |
-			perl -pe 'y/\000/g/'
-		} | ./t/helper/test-tool sha1 $cnt
-	)
-	if test "$expect" = "$actual"
-	then
-		echo "OK: $expect $cnt $pfx"
-	else
-		echo >&2 "OOPS: $cnt"
-		echo >&2 "expect: $expect"
-		echo >&2 "actual: $actual"
-		exit 1
-	fi
+	for sha1 in sha1 sha1-unsafe
+	do
+		actual=$(
+			{
+				test -z "$pfx" || echo "$pfx"
+				dd if=/dev/zero bs=1048576 count=$cnt 2>/dev/null |
+				perl -pe 'y/\000/g/'
+			} | ./t/helper/test-tool $sha1 $cnt
+		)
+		if test "$expect" = "$actual"
+		then
+			echo "OK ($sha1): $expect $cnt $pfx"
+		else
+			echo >&2 "OOPS ($sha1): $cnt"
+			echo >&2 "expect ($sha1): $expect"
+			echo >&2 "actual ($sha1): $actual"
+			exit 1
+		fi
+	done
 done <<EOF
 da39a3ee5e6b4b0d3255bfef95601890afd80709 0
 3f786850e387550fdab836ed7e6dc881de23001b 0 a
diff --git a/t/helper/test-sha256.c b/t/helper/test-sha256.c
index 2fb20438f3c..7fd0aa1fcd3 100644
--- a/t/helper/test-sha256.c
+++ b/t/helper/test-sha256.c
@@ -3,5 +3,5 @@
 
 int cmd__sha256(int ac, const char **av)
 {
-	return cmd_hash_impl(ac, av, GIT_HASH_SHA256);
+	return cmd_hash_impl(ac, av, GIT_HASH_SHA256, 0);
 }
diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
index 4a7aa993ba9..958452ef12e 100644
--- a/t/helper/test-tool.c
+++ b/t/helper/test-tool.c
@@ -70,6 +70,7 @@ static struct test_cmd cmds[] = {
 	{ "serve-v2", cmd__serve_v2 },
 	{ "sha1", cmd__sha1 },
 	{ "sha1-is-sha1dc", cmd__sha1_is_sha1dc },
+	{ "sha1-unsafe", cmd__sha1_unsafe },
 	{ "sha256", cmd__sha256 },
 	{ "sigchain", cmd__sigchain },
 	{ "simple-ipc", cmd__simple_ipc },
diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index 21802ac27da..24149edd414 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h
@@ -63,6 +63,7 @@ int cmd__scrap_cache_tree(int argc, const char **argv);
 int cmd__serve_v2(int argc, const char **argv);
 int cmd__sha1(int argc, const char **argv);
 int cmd__sha1_is_sha1dc(int argc, const char **argv);
+int cmd__sha1_unsafe(int argc, const char **argv);
 int cmd__sha256(int argc, const char **argv);
 int cmd__sigchain(int argc, const char **argv);
 int cmd__simple_ipc(int argc, const char **argv);
@@ -81,6 +82,6 @@ int cmd__windows_named_pipe(int argc, const char **argv);
 #endif
 int cmd__write_cache(int argc, const char **argv);
 
-int cmd_hash_impl(int ac, const char **av, int algo);
+int cmd_hash_impl(int ac, const char **av, int algo, int unsafe);
 
 #endif
-- 
2.48.0.rc2.33.gaab3d23ed4c


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v2 2/8] csum-file: store the hash algorithm as a struct field
  2025-01-08 19:14 ` [PATCH v2 0/8] hash: introduce unsafe_hash_algo(), drop unsafe_ variants Taylor Blau
  2025-01-08 19:14   ` [PATCH v2 1/8] t/helper/test-tool: implement sha1-unsafe helper Taylor Blau
@ 2025-01-08 19:14   ` Taylor Blau
  2025-01-16 11:48     ` Patrick Steinhardt
  2025-01-08 19:14   ` [PATCH v2 3/8] csum-file.c: extract algop from hashfile_checksum_valid() Taylor Blau
                     ` (7 subsequent siblings)
  9 siblings, 1 reply; 60+ messages in thread
From: Taylor Blau @ 2025-01-08 19:14 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Junio C Hamano, Elijah Newren, Patrick Steinhardt

Throughout the hashfile API, we rely on a reference to 'the_hash_algo',
and call its _usnafe function variants directly.

Prepare for a future change where we may use a different 'git_hash_algo'
pointer (instead of just relying on 'the_hash_algo' throughout) by
making the 'git_hash_algo' pointer a member of the 'hashfile' structure
itself.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 csum-file.c | 20 +++++++++++---------
 csum-file.h |  1 +
 2 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/csum-file.c b/csum-file.c
index 5716016e12e..b28cd047e3f 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -50,7 +50,7 @@ void hashflush(struct hashfile *f)
 
 	if (offset) {
 		if (!f->skip_hash)
-			the_hash_algo->unsafe_update_fn(&f->ctx, f->buffer, offset);
+			f->algop->unsafe_update_fn(&f->ctx, f->buffer, offset);
 		flush(f, f->buffer, offset);
 		f->offset = 0;
 	}
@@ -71,14 +71,14 @@ int finalize_hashfile(struct hashfile *f, unsigned char *result,
 	hashflush(f);
 
 	if (f->skip_hash)
-		hashclr(f->buffer, the_repository->hash_algo);
+		hashclr(f->buffer, f->algop);
 	else
-		the_hash_algo->unsafe_final_fn(f->buffer, &f->ctx);
+		f->algop->unsafe_final_fn(f->buffer, &f->ctx);
 
 	if (result)
-		hashcpy(result, f->buffer, the_repository->hash_algo);
+		hashcpy(result, f->buffer, f->algop);
 	if (flags & CSUM_HASH_IN_STREAM)
-		flush(f, f->buffer, the_hash_algo->rawsz);
+		flush(f, f->buffer, f->algop->rawsz);
 	if (flags & CSUM_FSYNC)
 		fsync_component_or_die(component, f->fd, f->name);
 	if (flags & CSUM_CLOSE) {
@@ -128,7 +128,7 @@ void hashwrite(struct hashfile *f, const void *buf, unsigned int count)
 			 * f->offset is necessarily zero.
 			 */
 			if (!f->skip_hash)
-				the_hash_algo->unsafe_update_fn(&f->ctx, buf, nr);
+				f->algop->unsafe_update_fn(&f->ctx, buf, nr);
 			flush(f, buf, nr);
 		} else {
 			/*
@@ -174,7 +174,9 @@ static struct hashfile *hashfd_internal(int fd, const char *name,
 	f->name = name;
 	f->do_crc = 0;
 	f->skip_hash = 0;
-	the_hash_algo->unsafe_init_fn(&f->ctx);
+
+	f->algop = the_hash_algo;
+	f->algop->unsafe_init_fn(&f->ctx);
 
 	f->buffer_len = buffer_len;
 	f->buffer = xmalloc(buffer_len);
@@ -208,7 +210,7 @@ void hashfile_checkpoint(struct hashfile *f, struct hashfile_checkpoint *checkpo
 {
 	hashflush(f);
 	checkpoint->offset = f->total;
-	the_hash_algo->unsafe_clone_fn(&checkpoint->ctx, &f->ctx);
+	f->algop->unsafe_clone_fn(&checkpoint->ctx, &f->ctx);
 }
 
 int hashfile_truncate(struct hashfile *f, struct hashfile_checkpoint *checkpoint)
@@ -219,7 +221,7 @@ int hashfile_truncate(struct hashfile *f, struct hashfile_checkpoint *checkpoint
 	    lseek(f->fd, offset, SEEK_SET) != offset)
 		return -1;
 	f->total = offset;
-	the_hash_algo->unsafe_clone_fn(&f->ctx, &checkpoint->ctx);
+	f->algop->unsafe_clone_fn(&f->ctx, &checkpoint->ctx);
 	f->offset = 0; /* hashflush() was called in checkpoint */
 	return 0;
 }
diff --git a/csum-file.h b/csum-file.h
index 7c73da0a40a..2b45f4673a2 100644
--- a/csum-file.h
+++ b/csum-file.h
@@ -20,6 +20,7 @@ struct hashfile {
 	size_t buffer_len;
 	unsigned char *buffer;
 	unsigned char *check_buffer;
+	const struct git_hash_algo *algop;
 
 	/**
 	 * If non-zero, skip_hash indicates that we should
-- 
2.48.0.rc2.33.gaab3d23ed4c


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 2/8] csum-file: store the hash algorithm as a struct field
  2025-01-08 19:14   ` [PATCH v2 2/8] csum-file: store the hash algorithm as a struct field Taylor Blau
@ 2025-01-16 11:48     ` Patrick Steinhardt
  2025-01-17 21:17       ` Taylor Blau
  0 siblings, 1 reply; 60+ messages in thread
From: Patrick Steinhardt @ 2025-01-16 11:48 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Junio C Hamano, Elijah Newren

On Wed, Jan 08, 2025 at 02:14:35PM -0500, Taylor Blau wrote:
> Throughout the hashfile API, we rely on a reference to 'the_hash_algo',
> and call its _usnafe function variants directly.

s/usnafe/unsafe/

> Prepare for a future change where we may use a different 'git_hash_algo'
> pointer (instead of just relying on 'the_hash_algo' throughout) by
> making the 'git_hash_algo' pointer a member of the 'hashfile' structure
> itself.

Makes sense, and it's also a good step for libification. I wonder: does
it mean that we can also get rid of `USE_THE_REPOSITORY_VARIABLE`, or do
we still depend on it in this file? The answer is yes, as we only reduce
the sites where we use `the_hash_algo`, but don't remove it altogether.
That would require the caller to provide the hash algo to us.

Patrick

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 2/8] csum-file: store the hash algorithm as a struct field
  2025-01-16 11:48     ` Patrick Steinhardt
@ 2025-01-17 21:17       ` Taylor Blau
  0 siblings, 0 replies; 60+ messages in thread
From: Taylor Blau @ 2025-01-17 21:17 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Jeff King, Junio C Hamano, Elijah Newren

On Thu, Jan 16, 2025 at 12:48:58PM +0100, Patrick Steinhardt wrote:
> On Wed, Jan 08, 2025 at 02:14:35PM -0500, Taylor Blau wrote:
> > Throughout the hashfile API, we rely on a reference to 'the_hash_algo',
> > and call its _usnafe function variants directly.
>
> s/usnafe/unsafe/

Oops. Thanks for spotting!

> > Prepare for a future change where we may use a different 'git_hash_algo'
> > pointer (instead of just relying on 'the_hash_algo' throughout) by
> > making the 'git_hash_algo' pointer a member of the 'hashfile' structure
> > itself.
>
> Makes sense, and it's also a good step for libification. I wonder: does
> it mean that we can also get rid of `USE_THE_REPOSITORY_VARIABLE`, or do
> we still depend on it in this file? The answer is yes, as we only reduce
> the sites where we use `the_hash_algo`, but don't remove it altogether.
> That would require the caller to provide the hash algo to us.

Yeah, I agree that we're not quite there yet, but likewise that this
patch is a step in that direction.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v2 3/8] csum-file.c: extract algop from hashfile_checksum_valid()
  2025-01-08 19:14 ` [PATCH v2 0/8] hash: introduce unsafe_hash_algo(), drop unsafe_ variants Taylor Blau
  2025-01-08 19:14   ` [PATCH v2 1/8] t/helper/test-tool: implement sha1-unsafe helper Taylor Blau
  2025-01-08 19:14   ` [PATCH v2 2/8] csum-file: store the hash algorithm as a struct field Taylor Blau
@ 2025-01-08 19:14   ` Taylor Blau
  2025-01-08 19:14   ` [PATCH v2 4/8] hash.h: introduce `unsafe_hash_algo()` Taylor Blau
                     ` (6 subsequent siblings)
  9 siblings, 0 replies; 60+ messages in thread
From: Taylor Blau @ 2025-01-08 19:14 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Junio C Hamano, Elijah Newren, Patrick Steinhardt

Perform a similar transformation as in the previous commit, but focused
instead on hashfile_checksum_valid(). This function does not work with a
hashfile structure itself, and instead validates the raw contents of a
file written using the hashfile API.

We'll want to be prepared for a similar change to this function in the
future, so prepare ourselves for that by extracting 'the_hash_algo' into
its own field for use within this function.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 csum-file.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/csum-file.c b/csum-file.c
index b28cd047e3f..7a71121e340 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -242,14 +242,15 @@ int hashfile_checksum_valid(const unsigned char *data, size_t total_len)
 {
 	unsigned char got[GIT_MAX_RAWSZ];
 	git_hash_ctx ctx;
-	size_t data_len = total_len - the_hash_algo->rawsz;
+	const struct git_hash_algo *algop = the_hash_algo;
+	size_t data_len = total_len - algop->rawsz;
 
-	if (total_len < the_hash_algo->rawsz)
+	if (total_len < algop->rawsz)
 		return 0; /* say "too short"? */
 
-	the_hash_algo->unsafe_init_fn(&ctx);
-	the_hash_algo->unsafe_update_fn(&ctx, data, data_len);
-	the_hash_algo->unsafe_final_fn(got, &ctx);
+	algop->unsafe_init_fn(&ctx);
+	algop->unsafe_update_fn(&ctx, data, data_len);
+	algop->unsafe_final_fn(got, &ctx);
 
-	return hasheq(got, data + data_len, the_repository->hash_algo);
+	return hasheq(got, data + data_len, algop);
 }
-- 
2.48.0.rc2.33.gaab3d23ed4c


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v2 4/8] hash.h: introduce `unsafe_hash_algo()`
  2025-01-08 19:14 ` [PATCH v2 0/8] hash: introduce unsafe_hash_algo(), drop unsafe_ variants Taylor Blau
                     ` (2 preceding siblings ...)
  2025-01-08 19:14   ` [PATCH v2 3/8] csum-file.c: extract algop from hashfile_checksum_valid() Taylor Blau
@ 2025-01-08 19:14   ` Taylor Blau
  2025-01-16 11:49     ` Patrick Steinhardt
  2025-01-08 19:14   ` [PATCH v2 5/8] csum-file.c: use unsafe_hash_algo() Taylor Blau
                     ` (5 subsequent siblings)
  9 siblings, 1 reply; 60+ messages in thread
From: Taylor Blau @ 2025-01-08 19:14 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Junio C Hamano, Elijah Newren, Patrick Steinhardt

In 253ed9ecff (hash.h: scaffolding for _unsafe hashing variants,
2024-09-26), we introduced "unsafe" variants of the SHA-1 hashing
functions by introducing new functions like "unsafe_init_fn()" and so
on.

This approach has a major shortcoming that callers must remember to
consistently use one variant or the other. Failing to consistently use
(or not use) the unsafe variants can lead to crashes at best, or subtle
memory corruption issues at worst.

In the hashfile API, this isn't difficult to achieve, but verifying that
all callers consistently use the unsafe variants is somewhat of a chore
given how spread out all of the callers are. In the sha1 and sha1-unsafe
test helpers, all of the calls to various hash functions are guarded by
an "if (unsafe)" conditional, which is repetitive and cumbersome.

Address these issues by introducing a new pattern whereby one
'git_hash_algo' can return a pointer to another 'git_hash_algo' that
represents the unsafe version of itself. So instead of having something
like:

    if (unsafe)
      the_hash_algo->init_fn(...);
      the_hash_algo->update_fn(...);
      the_hash_algo->final_fn(...);
    else
      the_hash_algo->unsafe_init_fn(...);
      the_hash_algo->unsafe_update_fn(...);
      the_hash_algo->unsafe_final_fn(...);

we can instead write:

    struct git_hash_algo *algop = the_hash_algo;
    if (unsafe)
      algop = unsafe_hash_algo(algop);

    the_hash_algo->init_fn(...);
    the_hash_algo->update_fn(...);
    the_hash_algo->final_fn(...);

This removes the existing shortcoming by no longer forcing the caller to
"remember" which variant of the hash functions it wants to call, only to
hold onto a 'struct git_hash_algo' pointer that is initialized once.

Similarly, while there currently is still a way to "mix" safe and unsafe
functions, this too will go away after subsequent commits remove all
direct calls to the unsafe_ variants.

Suggested-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 hash.h        |  5 +++++
 object-file.c | 26 ++++++++++++++++++++++++++
 2 files changed, 31 insertions(+)

diff --git a/hash.h b/hash.h
index 756166ce5e8..23cf6876e50 100644
--- a/hash.h
+++ b/hash.h
@@ -305,6 +305,9 @@ struct git_hash_algo {
 
 	/* The all-zeros OID. */
 	const struct object_id *null_oid;
+
+	/* The unsafe variant of this hash function, if one exists. */
+	const struct git_hash_algo *unsafe;
 };
 extern const struct git_hash_algo hash_algos[GIT_HASH_NALGOS];
 
@@ -323,6 +326,8 @@ static inline int hash_algo_by_ptr(const struct git_hash_algo *p)
 	return p - hash_algos;
 }
 
+const struct git_hash_algo *unsafe_hash_algo(const struct git_hash_algo *algop);
+
 const struct object_id *null_oid(void);
 
 static inline int hashcmp(const unsigned char *sha1, const unsigned char *sha2, const struct git_hash_algo *algop)
diff --git a/object-file.c b/object-file.c
index 5b792b3dd42..43efa4ca361 100644
--- a/object-file.c
+++ b/object-file.c
@@ -202,6 +202,22 @@ static void git_hash_unknown_final_oid(struct object_id *oid UNUSED,
 	BUG("trying to finalize unknown hash");
 }
 
+static const struct git_hash_algo sha1_unsafe_algo = {
+	.name = "sha1",
+	.format_id = GIT_SHA1_FORMAT_ID,
+	.rawsz = GIT_SHA1_RAWSZ,
+	.hexsz = GIT_SHA1_HEXSZ,
+	.blksz = GIT_SHA1_BLKSZ,
+	.init_fn = git_hash_sha1_init_unsafe,
+	.clone_fn = git_hash_sha1_clone_unsafe,
+	.update_fn = git_hash_sha1_update_unsafe,
+	.final_fn = git_hash_sha1_final_unsafe,
+	.final_oid_fn = git_hash_sha1_final_oid_unsafe,
+	.empty_tree = &empty_tree_oid,
+	.empty_blob = &empty_blob_oid,
+	.null_oid = &null_oid_sha1,
+};
+
 const struct git_hash_algo hash_algos[GIT_HASH_NALGOS] = {
 	{
 		.name = NULL,
@@ -239,6 +255,7 @@ const struct git_hash_algo hash_algos[GIT_HASH_NALGOS] = {
 		.unsafe_update_fn = git_hash_sha1_update_unsafe,
 		.unsafe_final_fn = git_hash_sha1_final_unsafe,
 		.unsafe_final_oid_fn = git_hash_sha1_final_oid_unsafe,
+		.unsafe = &sha1_unsafe_algo,
 		.empty_tree = &empty_tree_oid,
 		.empty_blob = &empty_blob_oid,
 		.null_oid = &null_oid_sha1,
@@ -305,6 +322,15 @@ int hash_algo_by_length(int len)
 	return GIT_HASH_UNKNOWN;
 }
 
+const struct git_hash_algo *unsafe_hash_algo(const struct git_hash_algo *algop)
+{
+	/* If we have a faster "unsafe" implementation, use that. */
+	if (algop->unsafe)
+		return algop->unsafe;
+	/* Otherwise use the default one. */
+	return algop;
+}
+
 /*
  * This is meant to hold a *small* number of objects that you would
  * want repo_read_object_file() to be able to return, but yet you do not want
-- 
2.48.0.rc2.33.gaab3d23ed4c


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 4/8] hash.h: introduce `unsafe_hash_algo()`
  2025-01-08 19:14   ` [PATCH v2 4/8] hash.h: introduce `unsafe_hash_algo()` Taylor Blau
@ 2025-01-16 11:49     ` Patrick Steinhardt
  2025-01-17 21:18       ` Taylor Blau
  0 siblings, 1 reply; 60+ messages in thread
From: Patrick Steinhardt @ 2025-01-16 11:49 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Junio C Hamano, Elijah Newren

On Wed, Jan 08, 2025 at 02:14:42PM -0500, Taylor Blau wrote:
> Address these issues by introducing a new pattern whereby one
> 'git_hash_algo' can return a pointer to another 'git_hash_algo' that
> represents the unsafe version of itself. So instead of having something
> like:
> 
>     if (unsafe)
>       the_hash_algo->init_fn(...);
>       the_hash_algo->update_fn(...);
>       the_hash_algo->final_fn(...);
>     else
>       the_hash_algo->unsafe_init_fn(...);
>       the_hash_algo->unsafe_update_fn(...);
>       the_hash_algo->unsafe_final_fn(...);
> 
> we can instead write:
> 
>     struct git_hash_algo *algop = the_hash_algo;
>     if (unsafe)
>       algop = unsafe_hash_algo(algop);
> 
>     the_hash_algo->init_fn(...);
>     the_hash_algo->update_fn(...);
>     the_hash_algo->final_fn(...);

This should all be `algop->init_fn(...)` and so on, right?

> This removes the existing shortcoming by no longer forcing the caller to
> "remember" which variant of the hash functions it wants to call, only to
> hold onto a 'struct git_hash_algo' pointer that is initialized once.

We still have to remember something, but now we only have to remember
the pointer to the hash algorithm itself, which definitely is less
tedious.

For what it's worth, this prompted me to have another look at my
proposal to stop having to use the hash algo altogether for `update()`
et al, and now have a version that works. It builds on top of your patch
series, which still is a step into the right direction. So I'll send it
once your series is being merged down.

Thanks!

Patrick

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 4/8] hash.h: introduce `unsafe_hash_algo()`
  2025-01-16 11:49     ` Patrick Steinhardt
@ 2025-01-17 21:18       ` Taylor Blau
  0 siblings, 0 replies; 60+ messages in thread
From: Taylor Blau @ 2025-01-17 21:18 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Jeff King, Junio C Hamano, Elijah Newren

On Thu, Jan 16, 2025 at 12:49:56PM +0100, Patrick Steinhardt wrote:
> > we can instead write:
> >
> >     struct git_hash_algo *algop = the_hash_algo;
> >     if (unsafe)
> >       algop = unsafe_hash_algo(algop);
> >
> >     the_hash_algo->init_fn(...);
> >     the_hash_algo->update_fn(...);
> >     the_hash_algo->final_fn(...);
>
> This should all be `algop->init_fn(...)` and so on, right?

Most definitely :-).

> For what it's worth, this prompted me to have another look at my
> proposal to stop having to use the hash algo altogether for `update()`
> et al, and now have a version that works. It builds on top of your patch
> series, which still is a step into the right direction. So I'll send it
> once your series is being merged down.

Exciting! I'm looking forward to reviewing it.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v2 5/8] csum-file.c: use unsafe_hash_algo()
  2025-01-08 19:14 ` [PATCH v2 0/8] hash: introduce unsafe_hash_algo(), drop unsafe_ variants Taylor Blau
                     ` (3 preceding siblings ...)
  2025-01-08 19:14   ` [PATCH v2 4/8] hash.h: introduce `unsafe_hash_algo()` Taylor Blau
@ 2025-01-08 19:14   ` Taylor Blau
  2025-01-08 19:14   ` [PATCH v2 6/8] t/helper/test-hash.c: " Taylor Blau
                     ` (4 subsequent siblings)
  9 siblings, 0 replies; 60+ messages in thread
From: Taylor Blau @ 2025-01-08 19:14 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Junio C Hamano, Elijah Newren, Patrick Steinhardt

Instead of calling the unsafe_ hash function variants directly, make use
of the shared 'algop' pointer by initializing it to:

    f->algop = unsafe_hash_algo(the_hash_algo);

, thus making all calls use the unsafe variants directly.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 csum-file.c | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/csum-file.c b/csum-file.c
index 7a71121e340..ebffc80ef71 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -50,7 +50,7 @@ void hashflush(struct hashfile *f)
 
 	if (offset) {
 		if (!f->skip_hash)
-			f->algop->unsafe_update_fn(&f->ctx, f->buffer, offset);
+			f->algop->update_fn(&f->ctx, f->buffer, offset);
 		flush(f, f->buffer, offset);
 		f->offset = 0;
 	}
@@ -73,7 +73,7 @@ int finalize_hashfile(struct hashfile *f, unsigned char *result,
 	if (f->skip_hash)
 		hashclr(f->buffer, f->algop);
 	else
-		f->algop->unsafe_final_fn(f->buffer, &f->ctx);
+		f->algop->final_fn(f->buffer, &f->ctx);
 
 	if (result)
 		hashcpy(result, f->buffer, f->algop);
@@ -128,7 +128,7 @@ void hashwrite(struct hashfile *f, const void *buf, unsigned int count)
 			 * f->offset is necessarily zero.
 			 */
 			if (!f->skip_hash)
-				f->algop->unsafe_update_fn(&f->ctx, buf, nr);
+				f->algop->update_fn(&f->ctx, buf, nr);
 			flush(f, buf, nr);
 		} else {
 			/*
@@ -175,8 +175,8 @@ static struct hashfile *hashfd_internal(int fd, const char *name,
 	f->do_crc = 0;
 	f->skip_hash = 0;
 
-	f->algop = the_hash_algo;
-	f->algop->unsafe_init_fn(&f->ctx);
+	f->algop = unsafe_hash_algo(the_hash_algo);
+	f->algop->init_fn(&f->ctx);
 
 	f->buffer_len = buffer_len;
 	f->buffer = xmalloc(buffer_len);
@@ -210,7 +210,7 @@ void hashfile_checkpoint(struct hashfile *f, struct hashfile_checkpoint *checkpo
 {
 	hashflush(f);
 	checkpoint->offset = f->total;
-	f->algop->unsafe_clone_fn(&checkpoint->ctx, &f->ctx);
+	f->algop->clone_fn(&checkpoint->ctx, &f->ctx);
 }
 
 int hashfile_truncate(struct hashfile *f, struct hashfile_checkpoint *checkpoint)
@@ -221,7 +221,7 @@ int hashfile_truncate(struct hashfile *f, struct hashfile_checkpoint *checkpoint
 	    lseek(f->fd, offset, SEEK_SET) != offset)
 		return -1;
 	f->total = offset;
-	f->algop->unsafe_clone_fn(&f->ctx, &checkpoint->ctx);
+	f->algop->clone_fn(&f->ctx, &checkpoint->ctx);
 	f->offset = 0; /* hashflush() was called in checkpoint */
 	return 0;
 }
@@ -242,15 +242,15 @@ int hashfile_checksum_valid(const unsigned char *data, size_t total_len)
 {
 	unsigned char got[GIT_MAX_RAWSZ];
 	git_hash_ctx ctx;
-	const struct git_hash_algo *algop = the_hash_algo;
+	const struct git_hash_algo *algop = unsafe_hash_algo(the_hash_algo);
 	size_t data_len = total_len - algop->rawsz;
 
 	if (total_len < algop->rawsz)
 		return 0; /* say "too short"? */
 
-	algop->unsafe_init_fn(&ctx);
-	algop->unsafe_update_fn(&ctx, data, data_len);
-	algop->unsafe_final_fn(got, &ctx);
+	algop->init_fn(&ctx);
+	algop->update_fn(&ctx, data, data_len);
+	algop->final_fn(got, &ctx);
 
 	return hasheq(got, data + data_len, algop);
 }
-- 
2.48.0.rc2.33.gaab3d23ed4c


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v2 6/8] t/helper/test-hash.c: use unsafe_hash_algo()
  2025-01-08 19:14 ` [PATCH v2 0/8] hash: introduce unsafe_hash_algo(), drop unsafe_ variants Taylor Blau
                     ` (4 preceding siblings ...)
  2025-01-08 19:14   ` [PATCH v2 5/8] csum-file.c: use unsafe_hash_algo() Taylor Blau
@ 2025-01-08 19:14   ` Taylor Blau
  2025-01-08 19:14   ` [PATCH v2 7/8] csum-file: introduce hashfile_checkpoint_init() Taylor Blau
                     ` (3 subsequent siblings)
  9 siblings, 0 replies; 60+ messages in thread
From: Taylor Blau @ 2025-01-08 19:14 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Junio C Hamano, Elijah Newren, Patrick Steinhardt

Remove a series of conditionals within the shared cmd_hash_impl() helper
that powers the 'sha1' and 'sha1-unsafe' helpers.

Instead, replace them with a single conditional that transforms the
specified hash algorithm into its unsafe variant. Then all subsequent
calls can directly use whatever function it wants to call without having
to decide between the safe and unsafe variants.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/helper/test-hash.c | 17 +++++------------
 1 file changed, 5 insertions(+), 12 deletions(-)

diff --git a/t/helper/test-hash.c b/t/helper/test-hash.c
index d0ee668df95..aa82638c621 100644
--- a/t/helper/test-hash.c
+++ b/t/helper/test-hash.c
@@ -9,6 +9,8 @@ int cmd_hash_impl(int ac, const char **av, int algo, int unsafe)
 	int binary = 0;
 	char *buffer;
 	const struct git_hash_algo *algop = &hash_algos[algo];
+	if (unsafe)
+		algop = unsafe_hash_algo(algop);
 
 	if (ac == 2) {
 		if (!strcmp(av[1], "-b"))
@@ -27,10 +29,7 @@ int cmd_hash_impl(int ac, const char **av, int algo, int unsafe)
 			die("OOPS");
 	}
 
-	if (unsafe)
-		algop->unsafe_init_fn(&ctx);
-	else
-		algop->init_fn(&ctx);
+	algop->init_fn(&ctx);
 
 	while (1) {
 		ssize_t sz, this_sz;
@@ -49,15 +48,9 @@ int cmd_hash_impl(int ac, const char **av, int algo, int unsafe)
 		}
 		if (this_sz == 0)
 			break;
-		if (unsafe)
-			algop->unsafe_update_fn(&ctx, buffer, this_sz);
-		else
-			algop->update_fn(&ctx, buffer, this_sz);
+		algop->update_fn(&ctx, buffer, this_sz);
 	}
-	if (unsafe)
-		algop->unsafe_final_fn(hash, &ctx);
-	else
-		algop->final_fn(hash, &ctx);
+	algop->final_fn(hash, &ctx);
 
 	if (binary)
 		fwrite(hash, 1, algop->rawsz, stdout);
-- 
2.48.0.rc2.33.gaab3d23ed4c


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v2 7/8] csum-file: introduce hashfile_checkpoint_init()
  2025-01-08 19:14 ` [PATCH v2 0/8] hash: introduce unsafe_hash_algo(), drop unsafe_ variants Taylor Blau
                     ` (5 preceding siblings ...)
  2025-01-08 19:14   ` [PATCH v2 6/8] t/helper/test-hash.c: " Taylor Blau
@ 2025-01-08 19:14   ` Taylor Blau
  2025-01-10 10:37     ` Jeff King
  2025-01-08 19:14   ` [PATCH v2 8/8] hash.h: drop unsafe_ function variants Taylor Blau
                     ` (2 subsequent siblings)
  9 siblings, 1 reply; 60+ messages in thread
From: Taylor Blau @ 2025-01-08 19:14 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Junio C Hamano, Elijah Newren, Patrick Steinhardt

In 106140a99f (builtin/fast-import: fix segfault with unsafe SHA1
backend, 2024-12-30) and 9218c0bfe1 (bulk-checkin: fix segfault with
unsafe SHA1 backend, 2024-12-30), we observed the effects of failing to
initialize a hashfile_checkpoint with the same hash function
implementation as is used by the hashfile it is used to checkpoint.

While both 106140a99f and 9218c0bfe1 work around the immediate crash,
changing the hash function implementation within the hashfile API to,
for example, the non-unsafe variant would re-introduce the crash. This
is a result of the tight coupling between initializing hashfiles and
hashfile_checkpoints.

Introduce and use a new function which ensures that both parts of a
hashfile and hashfile_checkpoint pair use the same hash function
implementation to avoid such crashes.

A few things worth noting:

  - In the change to builtin/fast-import.c::stream_blob(), we can see
    that by removing the explicit reference to
    'the_hash_algo->unsafe_init_fn()', we are hardened against the
    hashfile API changing away from the_hash_algo (or its unsafe
    variant) in the future.

  - The bulk-checkin code no longer needs to explicitly zero-initialize
    the hashfile_checkpoint, since it is now done as a result of calling
    'hashfile_checkpoint_init()'.

  - Also in the bulk-checkin code, we add an additional call to
    prepare_to_stream() outside of the main loop in order to initialize
    'state->f' so we know which hash function implementation to use when
    calling 'hashfile_checkpoint_init()'.

    This is OK, since subsequent 'prepare_to_stream()' calls are noops.
    However, we only need to call 'prepare_to_stream()' when we have the
    HASH_WRITE_OBJECT bit set in our flags. Without that bit, calling
    'prepare_to_stream()' does not assign 'state->f', so we have nothing
    to initialize.

  - Other uses of the 'checkpoint' in 'deflate_blob_to_pack()' are
    appropriately guarded.

Helped-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/fast-import.c | 2 +-
 bulk-checkin.c        | 9 ++++++---
 csum-file.c           | 7 +++++++
 csum-file.h           | 1 +
 4 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/builtin/fast-import.c b/builtin/fast-import.c
index 0f86392761a..4a6c7ab52ac 100644
--- a/builtin/fast-import.c
+++ b/builtin/fast-import.c
@@ -1106,7 +1106,7 @@ static void stream_blob(uintmax_t len, struct object_id *oidout, uintmax_t mark)
 		|| (pack_size + PACK_SIZE_THRESHOLD + len) < pack_size)
 		cycle_packfile();
 
-	the_hash_algo->unsafe_init_fn(&checkpoint.ctx);
+	hashfile_checkpoint_init(pack_file, &checkpoint);
 	hashfile_checkpoint(pack_file, &checkpoint);
 	offset = checkpoint.offset;
 
diff --git a/bulk-checkin.c b/bulk-checkin.c
index 433070a3bda..892176d23d2 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -261,7 +261,7 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 	git_hash_ctx ctx;
 	unsigned char obuf[16384];
 	unsigned header_len;
-	struct hashfile_checkpoint checkpoint = {0};
+	struct hashfile_checkpoint checkpoint;
 	struct pack_idx_entry *idx = NULL;
 
 	seekback = lseek(fd, 0, SEEK_CUR);
@@ -272,12 +272,15 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 					  OBJ_BLOB, size);
 	the_hash_algo->init_fn(&ctx);
 	the_hash_algo->update_fn(&ctx, obuf, header_len);
-	the_hash_algo->unsafe_init_fn(&checkpoint.ctx);
 
 	/* Note: idx is non-NULL when we are writing */
-	if ((flags & HASH_WRITE_OBJECT) != 0)
+	if ((flags & HASH_WRITE_OBJECT) != 0) {
 		CALLOC_ARRAY(idx, 1);
 
+		prepare_to_stream(state, flags);
+		hashfile_checkpoint_init(state->f, &checkpoint);
+	}
+
 	already_hashed_to = 0;
 
 	while (1) {
diff --git a/csum-file.c b/csum-file.c
index ebffc80ef71..232121f415f 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -206,6 +206,13 @@ struct hashfile *hashfd_throughput(int fd, const char *name, struct progress *tp
 	return hashfd_internal(fd, name, tp, 8 * 1024);
 }
 
+void hashfile_checkpoint_init(struct hashfile *f,
+			      struct hashfile_checkpoint *checkpoint)
+{
+	memset(checkpoint, 0, sizeof(*checkpoint));
+	f->algop->init_fn(&checkpoint->ctx);
+}
+
 void hashfile_checkpoint(struct hashfile *f, struct hashfile_checkpoint *checkpoint)
 {
 	hashflush(f);
diff --git a/csum-file.h b/csum-file.h
index 2b45f4673a2..b7475f16c20 100644
--- a/csum-file.h
+++ b/csum-file.h
@@ -36,6 +36,7 @@ struct hashfile_checkpoint {
 	git_hash_ctx ctx;
 };
 
+void hashfile_checkpoint_init(struct hashfile *, struct hashfile_checkpoint *);
 void hashfile_checkpoint(struct hashfile *, struct hashfile_checkpoint *);
 int hashfile_truncate(struct hashfile *, struct hashfile_checkpoint *);
 
-- 
2.48.0.rc2.33.gaab3d23ed4c


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 7/8] csum-file: introduce hashfile_checkpoint_init()
  2025-01-08 19:14   ` [PATCH v2 7/8] csum-file: introduce hashfile_checkpoint_init() Taylor Blau
@ 2025-01-10 10:37     ` Jeff King
  2025-01-10 21:50       ` Taylor Blau
  0 siblings, 1 reply; 60+ messages in thread
From: Jeff King @ 2025-01-10 10:37 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Junio C Hamano, Elijah Newren, Patrick Steinhardt

On Wed, Jan 08, 2025 at 02:14:51PM -0500, Taylor Blau wrote:

> Introduce and use a new function which ensures that both parts of a
> hashfile and hashfile_checkpoint pair use the same hash function
> implementation to avoid such crashes.

That makes sense. This should have been encapsulated all along, just
like the actual hash initialization happens inside hashfile_init().

A hashfile_checkpoint is sort of inherently tied to a hashfile, right? I
mean, it is recording an offset that only makes sense in the context of
the parent hashfile.

And that is only more true after the unsafe-hash patches, because now it
needs to use the "algop" pointer from the parent hashfile (though for
now we expect all hashfiles to use the same unsafe-algop, in theory we
could use different checksums for each file).

So in the new constructor:

> +void hashfile_checkpoint_init(struct hashfile *f,
> +			      struct hashfile_checkpoint *checkpoint)
> +{
> +	memset(checkpoint, 0, sizeof(*checkpoint));
> +	f->algop->init_fn(&checkpoint->ctx);
> +}

...should we actually record "f" itself? And then in the existing
functions:

>  void hashfile_checkpoint(struct hashfile *f, struct hashfile_checkpoint *checkpoint)

...they'd no longer need to take the extra parameter.

It creates a lifetime dependency of the checkpoint struct on the "f" it
is checkpointing, but I think that is naturally modeling the domain.

A semi-related thing I wondered about: do we need a destructor/release
function of some kind? Long ago when this checkpoint code was added, a
memcpy() of the sha_ctx struct was sufficient. But these days we use
clone_fn(), which may call openssl_SHA1_Clone(), which does
EVP_MD_CTX_copy_ex() under the hood. Do we have any promise that this
doesn't allocate any resources that might need a call to _Final() to
release (or I guess the more efficient way is directly EVP_MD_CTX_free()
under the hood).

My reading of the openssl manpages suggests that we should be doing
that, or we may see leaks. But it may also be the case that it doesn't
happen to trigger for their implementation.

At any rate, we do not seem to have such a cleanup function. So it is
certainly an orthogonal issue to your series. I wondered about it here
because if we did have one, it would be necessary to clean up checkpoint
before the hashfile due to the lifetime dependency I mentioned above.

-Peff

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 7/8] csum-file: introduce hashfile_checkpoint_init()
  2025-01-10 10:37     ` Jeff King
@ 2025-01-10 21:50       ` Taylor Blau
  2025-01-17 21:30         ` Taylor Blau
  0 siblings, 1 reply; 60+ messages in thread
From: Taylor Blau @ 2025-01-10 21:50 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Junio C Hamano, Elijah Newren, Patrick Steinhardt

On Fri, Jan 10, 2025 at 05:37:56AM -0500, Jeff King wrote:
> So in the new constructor:
>
> > +void hashfile_checkpoint_init(struct hashfile *f,
> > +			      struct hashfile_checkpoint *checkpoint)
> > +{
> > +	memset(checkpoint, 0, sizeof(*checkpoint));
> > +	f->algop->init_fn(&checkpoint->ctx);
> > +}
>
> ...should we actually record "f" itself? And then in the existing
> functions:
>
> >  void hashfile_checkpoint(struct hashfile *f, struct hashfile_checkpoint *checkpoint)
>
> ...they'd no longer need to take the extra parameter.
>
> It creates a lifetime dependency of the checkpoint struct on the "f" it
> is checkpointing, but I think that is naturally modeling the domain.

Thanks, I really like these suggestions. I adjusted the series
accordingly to do this cleanup in two patches (one for
hashfile_checkpoint(), another for hashfile_truncate()) after the patch
introducing hashfile_checkpoint_init().

> A semi-related thing I wondered about: do we need a destructor/release
> function of some kind? Long ago when this checkpoint code was added, a
> memcpy() of the sha_ctx struct was sufficient. But these days we use
> clone_fn(), which may call openssl_SHA1_Clone(), which does
> EVP_MD_CTX_copy_ex() under the hood. Do we have any promise that this
> doesn't allocate any resources that might need a call to _Final() to
> release (or I guess the more efficient way is directly EVP_MD_CTX_free()
> under the hood).
>
> My reading of the openssl manpages suggests that we should be doing
> that, or we may see leaks. But it may also be the case that it doesn't
> happen to trigger for their implementation.
>
> At any rate, we do not seem to have such a cleanup function. So it is
> certainly an orthogonal issue to your series. I wondered about it here
> because if we did have one, it would be necessary to clean up checkpoint
> before the hashfile due to the lifetime dependency I mentioned above.

I like the idea of a cleanup function, but let's do so in a separate
series.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 7/8] csum-file: introduce hashfile_checkpoint_init()
  2025-01-10 21:50       ` Taylor Blau
@ 2025-01-17 21:30         ` Taylor Blau
  2025-01-18 12:15           ` Jeff King
  0 siblings, 1 reply; 60+ messages in thread
From: Taylor Blau @ 2025-01-17 21:30 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Junio C Hamano, Elijah Newren, Patrick Steinhardt

On Fri, Jan 10, 2025 at 04:50:25PM -0500, Taylor Blau wrote:
> On Fri, Jan 10, 2025 at 05:37:56AM -0500, Jeff King wrote:
> > So in the new constructor:
> >
> > > +void hashfile_checkpoint_init(struct hashfile *f,
> > > +			      struct hashfile_checkpoint *checkpoint)
> > > +{
> > > +	memset(checkpoint, 0, sizeof(*checkpoint));
> > > +	f->algop->init_fn(&checkpoint->ctx);
> > > +}
> >
> > ...should we actually record "f" itself? And then in the existing
> > functions:
> >
> > >  void hashfile_checkpoint(struct hashfile *f, struct hashfile_checkpoint *checkpoint)
> >
> > ...they'd no longer need to take the extra parameter.
> >
> > It creates a lifetime dependency of the checkpoint struct on the "f" it
> > is checkpointing, but I think that is naturally modeling the domain.
>
> Thanks, I really like these suggestions. I adjusted the series
> accordingly to do this cleanup in two patches (one for
> hashfile_checkpoint(), another for hashfile_truncate()) after the patch
> introducing hashfile_checkpoint_init().

Hmm. I'm not sure that I like this as much as I thought I did.

I agree with you that ultimately the hashfile_checkpoint is (or should
be) tied to the lifetime of the hashfile that it is checkpointing
underneath. But in practice things are a little funky.

Let's suppose I did something like the following:

--- 8< ---
diff --git a/csum-file.c b/csum-file.c
index ebffc80ef7..47b8317a1f 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -206,6 +206,15 @@ struct hashfile *hashfd_throughput(int fd, const char *name, struct progress *tp
 	return hashfd_internal(fd, name, tp, 8 * 1024);
 }

+void hashfile_checkpoint_init(struct hashfile *f,
+			      struct hashfile_checkpoint *checkpoint)
+{
+	memset(checkpoint, 0, sizeof(*checkpoint));
+
+	checkpoint->f = f;
+	checkpoint->f->algop->init_fn(&checkpoint->ctx);
+}
+
 void hashfile_checkpoint(struct hashfile *f, struct hashfile_checkpoint *checkpoint)
 {
 	hashflush(f);
diff --git a/csum-file.h b/csum-file.h
index 2b45f4673a..8016509c71 100644
--- a/csum-file.h
+++ b/csum-file.h
@@ -34,8 +34,10 @@ struct hashfile {
 struct hashfile_checkpoint {
 	off_t offset;
 	git_hash_ctx ctx;
+	struct hashfile *f;
 };

+void hashfile_checkpoint_init(struct hashfile *, struct hashfile_checkpoint *);
 void hashfile_checkpoint(struct hashfile *, struct hashfile_checkpoint *);
 int hashfile_truncate(struct hashfile *, struct hashfile_checkpoint *);
--- >8 ---

, where I'm eliding the trivial changes necessary to make this work at
the two callers. Let's look a little closer at the bulk-checkin caller.
If I do this on top:

--- 8< ---
diff --git a/bulk-checkin.c b/bulk-checkin.c
index 433070a3bd..892176d23d 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -261,7 +261,7 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 	git_hash_ctx ctx;
 	unsigned char obuf[16384];
 	unsigned header_len;
-	struct hashfile_checkpoint checkpoint = {0};
+	struct hashfile_checkpoint checkpoint;
 	struct pack_idx_entry *idx = NULL;

 	seekback = lseek(fd, 0, SEEK_CUR);
@@ -272,12 +272,15 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 					  OBJ_BLOB, size);
 	the_hash_algo->init_fn(&ctx);
 	the_hash_algo->update_fn(&ctx, obuf, header_len);
-	the_hash_algo->unsafe_init_fn(&checkpoint.ctx);

 	/* Note: idx is non-NULL when we are writing */
-	if ((flags & HASH_WRITE_OBJECT) != 0)
+	if ((flags & HASH_WRITE_OBJECT) != 0) {
 		CALLOC_ARRAY(idx, 1);

+		prepare_to_stream(state, flags);
+		hashfile_checkpoint_init(state->f, &checkpoint);
+	}
+
 	already_hashed_to = 0;

 	while (1) {
--- >8 ---

then test 14 in t1050-large.sh fails because of a segfault in 'git add'.
If we compile with SANITIZE=address, we can see that there's a
use-after-free in hashflush(), which is called by hashfile_checkpoint().

That is a result of the max pack-size code. So we could try something
like:

--- 8< ---
diff --git a/bulk-checkin.c b/bulk-checkin.c
index 7b8a6eb2df..9dc114d132 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -261,7 +261,7 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 	git_hash_ctx ctx;
 	unsigned char obuf[16384];
 	unsigned header_len;
-	struct hashfile_checkpoint checkpoint;
+	struct hashfile_checkpoint checkpoint = { 0 };
 	struct pack_idx_entry *idx = NULL;

 	seekback = lseek(fd, 0, SEEK_CUR);
@@ -274,17 +274,14 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 	the_hash_algo->update_fn(&ctx, obuf, header_len);

 	/* Note: idx is non-NULL when we are writing */
-	if ((flags & HASH_WRITE_OBJECT) != 0) {
+	if ((flags & HASH_WRITE_OBJECT) != 0)
 		CALLOC_ARRAY(idx, 1);
-
-		prepare_to_stream(state, flags);
-		hashfile_checkpoint_init(state->f, &checkpoint);
-	}
-
 	already_hashed_to = 0;

 	while (1) {
 		prepare_to_stream(state, flags);
+		if (checkpoint.f != state->f)
+			hashfile_checkpoint_init(state->f, &checkpoint);
 		if (idx) {
 			hashfile_checkpoint(&checkpoint);
 			idx->offset = state->offset;
--- >8 ---

which would do the trick, but it feels awfully hacky to have the "if
(checkpoint.f != state->f)" check in there, since that feels too
intimately tied to the implementation of the hashfile_checkpoint API for
my comfort.

It would be nice if we could make the checkpoint only declared within
the loop body itself, but we can't because we need to call
hashfile_truncate() outside of the loop.

Anyway, that's all to say that I think that while this is probably
doable in theory, in practice it's kind of a mess, at least currently.
I would rather see if there are other ways to clean up the
deflate_blob_to_pack() function first in a way that made this change
less awkward.

I think the most reasonable course here would be to pursue a minimal
change like the one presented here and then think about further clean up
as a separate step.

Thanks,
Taylor

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 7/8] csum-file: introduce hashfile_checkpoint_init()
  2025-01-17 21:30         ` Taylor Blau
@ 2025-01-18 12:15           ` Jeff King
  0 siblings, 0 replies; 60+ messages in thread
From: Jeff King @ 2025-01-18 12:15 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Junio C Hamano, Elijah Newren, Patrick Steinhardt

On Fri, Jan 17, 2025 at 04:30:05PM -0500, Taylor Blau wrote:

> diff --git a/bulk-checkin.c b/bulk-checkin.c
> index 433070a3bd..892176d23d 100644
> --- a/bulk-checkin.c
> +++ b/bulk-checkin.c
> @@ -261,7 +261,7 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
>  	git_hash_ctx ctx;
>  	unsigned char obuf[16384];
>  	unsigned header_len;
> -	struct hashfile_checkpoint checkpoint = {0};
> +	struct hashfile_checkpoint checkpoint;
>  	struct pack_idx_entry *idx = NULL;
> 
>  	seekback = lseek(fd, 0, SEEK_CUR);
> @@ -272,12 +272,15 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
>  					  OBJ_BLOB, size);
>  	the_hash_algo->init_fn(&ctx);
>  	the_hash_algo->update_fn(&ctx, obuf, header_len);
> -	the_hash_algo->unsafe_init_fn(&checkpoint.ctx);
> 
>  	/* Note: idx is non-NULL when we are writing */
> -	if ((flags & HASH_WRITE_OBJECT) != 0)
> +	if ((flags & HASH_WRITE_OBJECT) != 0) {
>  		CALLOC_ARRAY(idx, 1);
> 
> +		prepare_to_stream(state, flags);
> +		hashfile_checkpoint_init(state->f, &checkpoint);
> +	}
> +
>  	already_hashed_to = 0;
> 
>  	while (1) {

Yeah, that's ugly. We are potentially throwing away the hashfile that
the checkpoint was created for. That makes my instinct to push the
checkpoint down into the loop where we we might restart a new pack, like
this (and like you suggested below):

diff --git a/bulk-checkin.c b/bulk-checkin.c
index 433070a3bd..efa59074fb 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -261,7 +261,6 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 	git_hash_ctx ctx;
 	unsigned char obuf[16384];
 	unsigned header_len;
-	struct hashfile_checkpoint checkpoint = {0};
 	struct pack_idx_entry *idx = NULL;

 	seekback = lseek(fd, 0, SEEK_CUR);
@@ -281,8 +280,10 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 	already_hashed_to = 0;

 	while (1) {
+		struct hashfile_checkpoint checkpoint = {0};
 		prepare_to_stream(state, flags);
 		if (idx) {
+			hashfile_checkpoint_init(state->f, &checkpoint);
 			hashfile_checkpoint(state->f, &checkpoint);
 			idx->offset = state->offset;
 			crc32_begin(state->f);

but that doesn't work, because the checkpoint is also used later for the
already_written() check:

        if (already_written(state, result_oid)) {
                hashfile_truncate(state->f, &checkpoint);
                state->offset = checkpoint.offset;
                free(idx);
	} else

That made me wonder if there is a bug lurking there. What if we found
the pack was too big, truncated to our checkpoint, and then opened a new
pack? Then the original checkpoint would now be bogus! It would mention
an offset in the original packfile which doesn't make any sense with
what we have open. But I think this is OK, because we can only leave the
loop when stream_blob_to_pack() returns, and we always establish a new
checkpoint before then.

So I do think that moving the initialization of the checkpoint into the
loop, but _not_ moving the variable would work the same way it does now
(i.e., what you suggested below).

But I admit that the way this loop works kind of makes my head spin. It
can really only ever run twice, but it is hard to see: we break out if
stream_blob_to_pack() returns success. And it will only return error if
we would bust the packsize limit (all other errors cause us to die()).
And only if we would bust the limit _and_ we are not the only object in
the pack. And since we start a new pack if we loop, that will never be
true on the second iteration; we'll always either die() or return
success.

I do think it would be much easier to read with a single explicit retry:

  if (checkpoint_and_try_to_stream() < 0) {
	/* we busted the limit; make a new pack and try again */
	hashfile_truncate();
	etc...
        if (checkpoint_and_try_to_stream() < 0)
		BUG("yikes, we should not fail a second time!");
  }

where checkpoint_and_try_to_stream() is the first half of the loop, up
to the stream_blob_to_pack() call.

Anyway, that is all outside of your patch, and relevant only because
_if_ we untangled it a bit more, it might make the checkpoint lifetime a
bit more obvious and less scary to refactor.

But it does imply to me that the data dependency introduced by my
suggestion is not always so straight-forward as I thought it would be,
and we should probably punt on it for your series.

> which would do the trick, but it feels awfully hacky to have the "if
> (checkpoint.f != state->f)" check in there, since that feels too
> intimately tied to the implementation of the hashfile_checkpoint API for
> my comfort.

I think you could unconditionally checkpoint at that part; we're about
to do a write, so we want to store the state before the write in case we
need to roll back.

> Anyway, that's all to say that I think that while this is probably
> doable in theory, in practice it's kind of a mess, at least currently.
> I would rather see if there are other ways to clean up the
> deflate_blob_to_pack() function first in a way that made this change
> less awkward.

Yeah, I actually wrote what I wrote above before reading this far down
in your email, but we arrived at the exact same conclusion. ;) Hopefully
what I wrote might give some pointers if somebody wants to refactor
later.

> I think the most reasonable course here would be to pursue a minimal
> change like the one presented here and then think about further clean up
> as a separate step.

Yep. Thanks for looking into it.

-Peff

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v2 8/8] hash.h: drop unsafe_ function variants
  2025-01-08 19:14 ` [PATCH v2 0/8] hash: introduce unsafe_hash_algo(), drop unsafe_ variants Taylor Blau
                     ` (6 preceding siblings ...)
  2025-01-08 19:14   ` [PATCH v2 7/8] csum-file: introduce hashfile_checkpoint_init() Taylor Blau
@ 2025-01-08 19:14   ` Taylor Blau
  2025-01-10 10:41   ` [PATCH v2 0/8] hash: introduce unsafe_hash_algo(), drop unsafe_ variants Jeff King
  2025-01-11  0:14   ` Junio C Hamano
  9 siblings, 0 replies; 60+ messages in thread
From: Taylor Blau @ 2025-01-08 19:14 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Junio C Hamano, Elijah Newren, Patrick Steinhardt

Now that all callers have been converted from:

    the_hash_algo->unsafe_init_fn();

to

    unsafe_hash_algo(the_hash_algo)->unsafe_init_fn();

and similar, we can remove the scaffolding for the unsafe_ function
variants and force callers to use the new unsafe_hash_algo() mechanic
instead.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 hash.h        | 15 ---------------
 object-file.c | 15 ---------------
 2 files changed, 30 deletions(-)

diff --git a/hash.h b/hash.h
index 23cf6876e50..6dcbf6ab835 100644
--- a/hash.h
+++ b/hash.h
@@ -282,21 +282,6 @@ struct git_hash_algo {
 	/* The hash finalization function for object IDs. */
 	git_hash_final_oid_fn final_oid_fn;
 
-	/* The non-cryptographic hash initialization function. */
-	git_hash_init_fn unsafe_init_fn;
-
-	/* The non-cryptographic hash context cloning function. */
-	git_hash_clone_fn unsafe_clone_fn;
-
-	/* The non-cryptographic hash update function. */
-	git_hash_update_fn unsafe_update_fn;
-
-	/* The non-cryptographic hash finalization function. */
-	git_hash_final_fn unsafe_final_fn;
-
-	/* The non-cryptographic hash finalization function. */
-	git_hash_final_oid_fn unsafe_final_oid_fn;
-
 	/* The OID of the empty tree. */
 	const struct object_id *empty_tree;
 
diff --git a/object-file.c b/object-file.c
index 43efa4ca361..c4b42dd4be9 100644
--- a/object-file.c
+++ b/object-file.c
@@ -230,11 +230,6 @@ const struct git_hash_algo hash_algos[GIT_HASH_NALGOS] = {
 		.update_fn = git_hash_unknown_update,
 		.final_fn = git_hash_unknown_final,
 		.final_oid_fn = git_hash_unknown_final_oid,
-		.unsafe_init_fn = git_hash_unknown_init,
-		.unsafe_clone_fn = git_hash_unknown_clone,
-		.unsafe_update_fn = git_hash_unknown_update,
-		.unsafe_final_fn = git_hash_unknown_final,
-		.unsafe_final_oid_fn = git_hash_unknown_final_oid,
 		.empty_tree = NULL,
 		.empty_blob = NULL,
 		.null_oid = NULL,
@@ -250,11 +245,6 @@ const struct git_hash_algo hash_algos[GIT_HASH_NALGOS] = {
 		.update_fn = git_hash_sha1_update,
 		.final_fn = git_hash_sha1_final,
 		.final_oid_fn = git_hash_sha1_final_oid,
-		.unsafe_init_fn = git_hash_sha1_init_unsafe,
-		.unsafe_clone_fn = git_hash_sha1_clone_unsafe,
-		.unsafe_update_fn = git_hash_sha1_update_unsafe,
-		.unsafe_final_fn = git_hash_sha1_final_unsafe,
-		.unsafe_final_oid_fn = git_hash_sha1_final_oid_unsafe,
 		.unsafe = &sha1_unsafe_algo,
 		.empty_tree = &empty_tree_oid,
 		.empty_blob = &empty_blob_oid,
@@ -271,11 +261,6 @@ const struct git_hash_algo hash_algos[GIT_HASH_NALGOS] = {
 		.update_fn = git_hash_sha256_update,
 		.final_fn = git_hash_sha256_final,
 		.final_oid_fn = git_hash_sha256_final_oid,
-		.unsafe_init_fn = git_hash_sha256_init,
-		.unsafe_clone_fn = git_hash_sha256_clone,
-		.unsafe_update_fn = git_hash_sha256_update,
-		.unsafe_final_fn = git_hash_sha256_final,
-		.unsafe_final_oid_fn = git_hash_sha256_final_oid,
 		.empty_tree = &empty_tree_oid_sha256,
 		.empty_blob = &empty_blob_oid_sha256,
 		.null_oid = &null_oid_sha256,
-- 
2.48.0.rc2.33.gaab3d23ed4c

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 0/8] hash: introduce unsafe_hash_algo(), drop unsafe_ variants
  2025-01-08 19:14 ` [PATCH v2 0/8] hash: introduce unsafe_hash_algo(), drop unsafe_ variants Taylor Blau
                     ` (7 preceding siblings ...)
  2025-01-08 19:14   ` [PATCH v2 8/8] hash.h: drop unsafe_ function variants Taylor Blau
@ 2025-01-10 10:41   ` Jeff King
  2025-01-10 21:29     ` Taylor Blau
  2025-01-11  0:14   ` Junio C Hamano
  9 siblings, 1 reply; 60+ messages in thread
From: Jeff King @ 2025-01-10 10:41 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Junio C Hamano, Elijah Newren, Patrick Steinhardt

On Wed, Jan 08, 2025 at 02:14:29PM -0500, Taylor Blau wrote:

> (This series is rebased on 'master', which is 14650065b7
> (RelNotes/2.48.0: fix typos etc., 2025-01-07) at the time of writing).
> 
> The bulk of this series is unchanged since last time, but a new seventh
> patch that further hardens the hashfile_checkpoint callers on top of
> Patrick's recent series[1].

I think that new patch is a definite improvement, though I left some
comments there.

The changes in patch 1 look fine to me (I still think a generic
"test-tool hash" would make more sense, but I'm OK punting on that for
now).

I didn't see any response to the review in round 1 about the pointer
dangers in patch 3. What do you think of using a separate
git_hash_algo_fns struct, with the one-time conversion I showed in the
subthread of:

  https://lore.kernel.org/git/20241121093731.GD602681@coredump.intra.peff.net/

?

-Peff

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 0/8] hash: introduce unsafe_hash_algo(), drop unsafe_ variants
  2025-01-10 10:41   ` [PATCH v2 0/8] hash: introduce unsafe_hash_algo(), drop unsafe_ variants Jeff King
@ 2025-01-10 21:29     ` Taylor Blau
  2025-01-11  2:42       ` Jeff King
  0 siblings, 1 reply; 60+ messages in thread
From: Taylor Blau @ 2025-01-10 21:29 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Junio C Hamano, Elijah Newren, Patrick Steinhardt

On Fri, Jan 10, 2025 at 05:41:06AM -0500, Jeff King wrote:
> On Wed, Jan 08, 2025 at 02:14:29PM -0500, Taylor Blau wrote:
>
> > (This series is rebased on 'master', which is 14650065b7
> > (RelNotes/2.48.0: fix typos etc., 2025-01-07) at the time of writing).
> >
> > The bulk of this series is unchanged since last time, but a new seventh
> > patch that further hardens the hashfile_checkpoint callers on top of
> > Patrick's recent series[1].
>
> I think that new patch is a definite improvement, though I left some
> comments there.
>
> The changes in patch 1 look fine to me (I still think a generic
> "test-tool hash" would make more sense, but I'm OK punting on that for
> now).
>
> I didn't see any response to the review in round 1 about the pointer
> dangers in patch 3. What do you think of using a separate
> git_hash_algo_fns struct, with the one-time conversion I showed in the
> subthread of:
>
>   https://lore.kernel.org/git/20241121093731.GD602681@coredump.intra.peff.net/
>
> ?

Oops. I must have missed those messages; and sure enough when focusing
my inbox on this thread they are indeed unread :-).

That said, I am not sure that that direction is one that I'd want to go
in. Part of the goal of this series is to make it possible to mix safe
and unsafe function calls on the same hash function. So doing something
like:

    struct git_hash_algo *algop;

    algop->init_fn(&ctx);

in one part of the code, and then (using the same algop) calling:

    algop->unsafe_final_fn(...);

should be impossible to do to. The benefit of having only a single set
of functions implemented on the git_hash_algo type is that it is
impossible to mix the two: you'd have to use a different git_hash_algo
altogether!

So porting the above example to your and brian's git_hash_algo_fns
struct, you'd still be able to do:

    algop->fn.init(&ctx);

in one part of the code and algop->unsafe_fn.final(...) in another part,
which doesn't appear to me to be safer than the current situation that
this series aims to solve.

But if I am misunderstanding something about your/brian's discussion
earlier in this thread, please feel free to correct me.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 0/8] hash: introduce unsafe_hash_algo(), drop unsafe_ variants
  2025-01-10 21:29     ` Taylor Blau
@ 2025-01-11  2:42       ` Jeff King
  0 siblings, 0 replies; 60+ messages in thread
From: Jeff King @ 2025-01-11  2:42 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Junio C Hamano, Elijah Newren, Patrick Steinhardt

On Fri, Jan 10, 2025 at 04:29:38PM -0500, Taylor Blau wrote:

> > I didn't see any response to the review in round 1 about the pointer
> > dangers in patch 3. What do you think of using a separate
> > git_hash_algo_fns struct, with the one-time conversion I showed in the
> > subthread of:
> >
> >   https://lore.kernel.org/git/20241121093731.GD602681@coredump.intra.peff.net/
> >
> > ?
> 
> Oops. I must have missed those messages; and sure enough when focusing
> my inbox on this thread they are indeed unread :-).
> 
> That said, I am not sure that that direction is one that I'd want to go
> in. Part of the goal of this series is to make it possible to mix safe
> and unsafe function calls on the same hash function. So doing something
> like:
> 
>     struct git_hash_algo *algop;
> 
>     algop->init_fn(&ctx);
> 
> in one part of the code, and then (using the same algop) calling:
> 
>     algop->unsafe_final_fn(...);
> 
> should be impossible to do to. The benefit of having only a single set
> of functions implemented on the git_hash_algo type is that it is
> impossible to mix the two: you'd have to use a different git_hash_algo
> altogether!
> 
> So porting the above example to your and brian's git_hash_algo_fns
> struct, you'd still be able to do:
> 
>     algop->fn.init(&ctx);
> 
> in one part of the code and algop->unsafe_fn.final(...) in another part,
> which doesn't appear to me to be safer than the current situation that
> this series aims to solve.

I think what that proposal is doing is orthogonal to the goal of your
series. You'd still have an unsafe_hash_algo() function, but it would
return a git_hash_algo_fns struct, and that's what struct hashfile would
store. So your patches would still remain.

The advantage is mostly that you can't confuse it with a regular
git_hash_algo struct, so it avoids the pointer and hash-id issues.

I do think there is one gotcha, though, which is that the hashfile code
probably still needs the outer algop pointer for things like
algop->raw_size. So you'd have to store both.

It's _possible_ to still confuse the two, but the idea is that you'd
have to explicitly call algop->fn, to get the wrong one there. If we
wanted to make that harder to get wrong, we could start making it a
habit to never use algo->fn directly, but to ask for the safe/unsafe
git_hash_algo_fns struct. But that would be even more churn in the
surrounding code. I think just doing it consistently within hashfile
(which is the only unsafe user) would be sufficient.

-Peff

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 0/8] hash: introduce unsafe_hash_algo(), drop unsafe_ variants
  2025-01-08 19:14 ` [PATCH v2 0/8] hash: introduce unsafe_hash_algo(), drop unsafe_ variants Taylor Blau
                     ` (8 preceding siblings ...)
  2025-01-10 10:41   ` [PATCH v2 0/8] hash: introduce unsafe_hash_algo(), drop unsafe_ variants Jeff King
@ 2025-01-11  0:14   ` Junio C Hamano
  2025-01-11 17:14     ` Taylor Blau
  9 siblings, 1 reply; 60+ messages in thread
From: Junio C Hamano @ 2025-01-11  0:14 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Elijah Newren, Patrick Steinhardt

Taylor Blau <me@ttaylorr.com> writes:

> (This series is rebased on 'master', which is 14650065b7
> (RelNotes/2.48.0: fix typos etc., 2025-01-07) at the time of writing).

The previous round was based on
<cover.1730833506.git.me@ttaylorr.com> which became
'tb/unsafe-hash-test', but this round is based on a recent 'master'
that does not yet contain it?  Does it mean that the 2-patch series
the previous round of this series was based on is no longer needed?

Thanks.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 0/8] hash: introduce unsafe_hash_algo(), drop unsafe_ variants
  2025-01-11  0:14   ` Junio C Hamano
@ 2025-01-11 17:14     ` Taylor Blau
  0 siblings, 0 replies; 60+ messages in thread
From: Taylor Blau @ 2025-01-11 17:14 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Jeff King, Elijah Newren, Patrick Steinhardt

On Fri, Jan 10, 2025 at 04:14:58PM -0800, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > (This series is rebased on 'master', which is 14650065b7
> > (RelNotes/2.48.0: fix typos etc., 2025-01-07) at the time of writing).
>
> The previous round was based on
> <cover.1730833506.git.me@ttaylorr.com> which became
> 'tb/unsafe-hash-test', but this round is based on a recent 'master'
> that does not yet contain it?  Does it mean that the 2-patch series
> the previous round of this series was based on is no longer needed?

Those two patches got squashed together and became the first patch of
this series, so 'tb/unsafe-hash-test' is safe to discard. Thank you for
shuffling the patches around as always :-).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v3 0/8] hash: introduce unsafe_hash_algo(), drop unsafe_ variants
  2024-11-20 19:13 [PATCH 0/6] hash: introduce unsafe_hash_algo(), drop unsafe_ variants Taylor Blau
                   ` (6 preceding siblings ...)
  2025-01-08 19:14 ` [PATCH v2 0/8] hash: introduce unsafe_hash_algo(), drop unsafe_ variants Taylor Blau
@ 2025-01-17 22:03 ` Taylor Blau
  2025-01-17 22:03   ` [PATCH v3 1/8] t/helper/test-tool: implement sha1-unsafe helper Taylor Blau
                     ` (8 more replies)
  2025-01-23 17:34 ` [PATCH v4 " Taylor Blau
  8 siblings, 9 replies; 60+ messages in thread
From: Taylor Blau @ 2025-01-17 22:03 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Junio C Hamano, Elijah Newren, Patrick Steinhardt

(This series is based on 14650065b7 (RelNotes/2.48.0: fix typos etc.,
2025-01-07)).

The bulk of this series is unchanged since last time, save for a couple
of typofixes on spots noticed by Peff and Patrick Steinhardt. More
importantly, it fixes hash_algo_by_ptr() when passing the unsafe SHA-1
variant.

There were a couple of other ideas floated around, but I don't think
they panned out as we had hoped in practice, so I think that this
version should be good to go.

The original cover letter is as follows:

------------

This series implements an idea discussed in [2] which suggests that we
introduce a way to access a wrapped version of a 'struct git_hash_algo'
which represents the unsafe variant of that algorithm, rather than
having individual unsafe_ functions (like unsafe_init_fn() versus
init_fn(), etc.).

This approach is relatively straightforward to implement, and removes a
significant deficiency in the original implementation of
unsafe/non-cryptographic hash functions by making it impossible to
switch between safe- and unsafe variants of hash functions. It also
cleans up the sha1-unsafe test helper's implementation by removing a
large number of "if (unsafe)"-style conditionals.

The series is laid out as follows:

  * The first two patches prepare the hashfile API for the upcoming
    change:

      csum-file: store the hash algorithm as a struct field
      csum-file.c: extract algop from hashfile_checksum_valid()

  * The next patch implements the new 'unsafe_hash_algo()' function at
    the heart of this series' approach:

      hash.h: introduce `unsafe_hash_algo()`

  * The next two patches convert existing callers to use the new
    'unsafe_hash_algo()' function, instead of switching between safe and
    unsafe_ variants of individual functions:

      csum-file.c: use unsafe_hash_algo()
      t/helper/test-hash.c: use unsafe_hash_algo()

  * The final patch drops the unsafe_ function variants following all
    callers being converted to use the new pattern:

      hash.h: drop unsafe_ function variants

Thanks in advance for your review!

[1]: https://lore.kernel.org/git/20241230-pks-meson-sha1-unsafe-v1-0-efb276e171f5@pks.im/
[2]: https://lore.kernel.org/git/20241107013915.GA961214@coredump.intra.peff.net/

Taylor Blau (8):
  t/helper/test-tool: implement sha1-unsafe helper
  csum-file: store the hash algorithm as a struct field
  csum-file.c: extract algop from hashfile_checksum_valid()
  hash.h: introduce `unsafe_hash_algo()`
  csum-file.c: use unsafe_hash_algo()
  t/helper/test-hash.c: use unsafe_hash_algo()
  csum-file: introduce hashfile_checkpoint_init()
  hash.h: drop unsafe_ function variants

 builtin/fast-import.c  |  2 +-
 bulk-checkin.c         |  9 ++++++---
 csum-file.c            | 40 +++++++++++++++++++++++++---------------
 csum-file.h            |  2 ++
 hash.h                 | 28 ++++++++++++----------------
 object-file.c          | 41 ++++++++++++++++++++++++++---------------
 t/helper/test-hash.c   |  4 +++-
 t/helper/test-sha1.c   |  7 ++++++-
 t/helper/test-sha1.sh  | 38 ++++++++++++++++++++++----------------
 t/helper/test-sha256.c |  2 +-
 t/helper/test-tool.c   |  1 +
 t/helper/test-tool.h   |  3 ++-
 12 files changed, 107 insertions(+), 70 deletions(-)

Range-diff against v2:
1:  4c1523a04f1 = 1:  ae6b8c75294 t/helper/test-tool: implement sha1-unsafe helper
2:  99cc44895b5 ! 2:  2b79c76e471 csum-file: store the hash algorithm as a struct field
    @@ Commit message
         csum-file: store the hash algorithm as a struct field
     
         Throughout the hashfile API, we rely on a reference to 'the_hash_algo',
    -    and call its _usnafe function variants directly.
    +    and call its _unsafe function variants directly.
     
         Prepare for a future change where we may use a different 'git_hash_algo'
         pointer (instead of just relying on 'the_hash_algo' throughout) by
3:  1ffab2f8289 = 3:  d7deb3f338e csum-file.c: extract algop from hashfile_checksum_valid()
4:  99dcbe2e716 ! 4:  b6b2bb2714f hash.h: introduce `unsafe_hash_algo()`
    @@ Commit message
             if (unsafe)
               algop = unsafe_hash_algo(algop);
     
    -        the_hash_algo->init_fn(...);
    -        the_hash_algo->update_fn(...);
    -        the_hash_algo->final_fn(...);
    +        algop->init_fn(...);
    +        algop->update_fn(...);
    +        algop->final_fn(...);
     
         This removes the existing shortcoming by no longer forcing the caller to
         "remember" which variant of the hash functions it wants to call, only to
    @@ Commit message
         functions, this too will go away after subsequent commits remove all
         direct calls to the unsafe_ variants.
     
    +    Note that hash_algo_by_ptr() needs an adjustment to allow passing in the
    +    unsafe variant of a hash function. All other query functions on the
    +    hash_algos array will continue to return the safe variants of any
    +    function.
    +
         Suggested-by: Jeff King <peff@peff.net>
         Signed-off-by: Taylor Blau <me@ttaylorr.com>
     
    @@ hash.h: struct git_hash_algo {
      };
      extern const struct git_hash_algo hash_algos[GIT_HASH_NALGOS];
      
    -@@ hash.h: static inline int hash_algo_by_ptr(const struct git_hash_algo *p)
    - 	return p - hash_algos;
    +@@ hash.h: int hash_algo_by_length(int len);
    + /* Identical, except for a pointer to struct git_hash_algo. */
    + static inline int hash_algo_by_ptr(const struct git_hash_algo *p)
    + {
    +-	return p - hash_algos;
    ++	size_t i;
    ++	for (i = 0; i < GIT_HASH_NALGOS; i++) {
    ++		const struct git_hash_algo *algop = &hash_algos[i];
    ++		if (p == algop || (algop->unsafe && p == algop->unsafe))
    ++			return i;
    ++	}
    ++	return GIT_HASH_UNKNOWN;
      }
      
     +const struct git_hash_algo *unsafe_hash_algo(const struct git_hash_algo *algop);
5:  2dcc2aa6803 = 5:  ca67de80971 csum-file.c: use unsafe_hash_algo()
6:  a2b9ef03080 = 6:  21b175b07ff t/helper/test-hash.c: use unsafe_hash_algo()
7:  94c07fd8a55 = 7:  850d4f407db csum-file: introduce hashfile_checkpoint_init()
8:  f5579883816 ! 8:  0c4d006e6e8 hash.h: drop unsafe_ function variants
    @@ Commit message
     
         to
     
    -        unsafe_hash_algo(the_hash_algo)->unsafe_init_fn();
    +        unsafe_hash_algo(the_hash_algo)->init_fn();
     
         and similar, we can remove the scaffolding for the unsafe_ function
         variants and force callers to use the new unsafe_hash_algo() mechanic

base-commit: 14650065b76b28d3cfa9453356ac5669b19e706e
-- 
2.48.0.rc2.35.g0c4d006e6e8

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v3 1/8] t/helper/test-tool: implement sha1-unsafe helper
  2025-01-17 22:03 ` [PATCH v3 " Taylor Blau
@ 2025-01-17 22:03   ` Taylor Blau
  2025-01-17 22:03   ` [PATCH v3 2/8] csum-file: store the hash algorithm as a struct field Taylor Blau
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 60+ messages in thread
From: Taylor Blau @ 2025-01-17 22:03 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Junio C Hamano, Elijah Newren, Patrick Steinhardt

With the new "unsafe" SHA-1 build knob, it is convenient to have a
test-tool that can exercise Git's unsafe SHA-1 wrappers for testing,
similar to 't/helper/test-tool sha1'.

Implement that helper by altering the implementation of that test-tool
(in cmd_hash_impl(), which is generic and parameterized over different
hash functions) to conditionally run the unsafe variants of the chosen
hash function, and expose the new behavior via a new 'sha1-unsafe' test
helper.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/helper/test-hash.c   | 17 +++++++++++++----
 t/helper/test-sha1.c   |  7 ++++++-
 t/helper/test-sha1.sh  | 38 ++++++++++++++++++++++----------------
 t/helper/test-sha256.c |  2 +-
 t/helper/test-tool.c   |  1 +
 t/helper/test-tool.h   |  3 ++-
 6 files changed, 45 insertions(+), 23 deletions(-)

diff --git a/t/helper/test-hash.c b/t/helper/test-hash.c
index 45d829c908f..d0ee668df95 100644
--- a/t/helper/test-hash.c
+++ b/t/helper/test-hash.c
@@ -1,7 +1,7 @@
 #include "test-tool.h"
 #include "hex.h"
 
-int cmd_hash_impl(int ac, const char **av, int algo)
+int cmd_hash_impl(int ac, const char **av, int algo, int unsafe)
 {
 	git_hash_ctx ctx;
 	unsigned char hash[GIT_MAX_HEXSZ];
@@ -27,7 +27,10 @@ int cmd_hash_impl(int ac, const char **av, int algo)
 			die("OOPS");
 	}
 
-	algop->init_fn(&ctx);
+	if (unsafe)
+		algop->unsafe_init_fn(&ctx);
+	else
+		algop->init_fn(&ctx);
 
 	while (1) {
 		ssize_t sz, this_sz;
@@ -46,9 +49,15 @@ int cmd_hash_impl(int ac, const char **av, int algo)
 		}
 		if (this_sz == 0)
 			break;
-		algop->update_fn(&ctx, buffer, this_sz);
+		if (unsafe)
+			algop->unsafe_update_fn(&ctx, buffer, this_sz);
+		else
+			algop->update_fn(&ctx, buffer, this_sz);
 	}
-	algop->final_fn(hash, &ctx);
+	if (unsafe)
+		algop->unsafe_final_fn(hash, &ctx);
+	else
+		algop->final_fn(hash, &ctx);
 
 	if (binary)
 		fwrite(hash, 1, algop->rawsz, stdout);
diff --git a/t/helper/test-sha1.c b/t/helper/test-sha1.c
index e60d000c039..349540c4df8 100644
--- a/t/helper/test-sha1.c
+++ b/t/helper/test-sha1.c
@@ -3,7 +3,7 @@
 
 int cmd__sha1(int ac, const char **av)
 {
-	return cmd_hash_impl(ac, av, GIT_HASH_SHA1);
+	return cmd_hash_impl(ac, av, GIT_HASH_SHA1, 0);
 }
 
 int cmd__sha1_is_sha1dc(int argc UNUSED, const char **argv UNUSED)
@@ -13,3 +13,8 @@ int cmd__sha1_is_sha1dc(int argc UNUSED, const char **argv UNUSED)
 #endif
 	return 1;
 }
+
+int cmd__sha1_unsafe(int ac, const char **av)
+{
+	return cmd_hash_impl(ac, av, GIT_HASH_SHA1, 1);
+}
diff --git a/t/helper/test-sha1.sh b/t/helper/test-sha1.sh
index 84594885c70..bf387d3db14 100755
--- a/t/helper/test-sha1.sh
+++ b/t/helper/test-sha1.sh
@@ -3,25 +3,31 @@
 dd if=/dev/zero bs=1048576 count=100 2>/dev/null |
 /usr/bin/time t/helper/test-tool sha1 >/dev/null
 
+dd if=/dev/zero bs=1048576 count=100 2>/dev/null |
+/usr/bin/time t/helper/test-tool sha1-unsafe >/dev/null
+
 while read expect cnt pfx
 do
 	case "$expect" in '#'*) continue ;; esac
-	actual=$(
-		{
-			test -z "$pfx" || echo "$pfx"
-			dd if=/dev/zero bs=1048576 count=$cnt 2>/dev/null |
-			perl -pe 'y/\000/g/'
-		} | ./t/helper/test-tool sha1 $cnt
-	)
-	if test "$expect" = "$actual"
-	then
-		echo "OK: $expect $cnt $pfx"
-	else
-		echo >&2 "OOPS: $cnt"
-		echo >&2 "expect: $expect"
-		echo >&2 "actual: $actual"
-		exit 1
-	fi
+	for sha1 in sha1 sha1-unsafe
+	do
+		actual=$(
+			{
+				test -z "$pfx" || echo "$pfx"
+				dd if=/dev/zero bs=1048576 count=$cnt 2>/dev/null |
+				perl -pe 'y/\000/g/'
+			} | ./t/helper/test-tool $sha1 $cnt
+		)
+		if test "$expect" = "$actual"
+		then
+			echo "OK ($sha1): $expect $cnt $pfx"
+		else
+			echo >&2 "OOPS ($sha1): $cnt"
+			echo >&2 "expect ($sha1): $expect"
+			echo >&2 "actual ($sha1): $actual"
+			exit 1
+		fi
+	done
 done <<EOF
 da39a3ee5e6b4b0d3255bfef95601890afd80709 0
 3f786850e387550fdab836ed7e6dc881de23001b 0 a
diff --git a/t/helper/test-sha256.c b/t/helper/test-sha256.c
index 2fb20438f3c..7fd0aa1fcd3 100644
--- a/t/helper/test-sha256.c
+++ b/t/helper/test-sha256.c
@@ -3,5 +3,5 @@
 
 int cmd__sha256(int ac, const char **av)
 {
-	return cmd_hash_impl(ac, av, GIT_HASH_SHA256);
+	return cmd_hash_impl(ac, av, GIT_HASH_SHA256, 0);
 }
diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
index 4a7aa993ba9..958452ef12e 100644
--- a/t/helper/test-tool.c
+++ b/t/helper/test-tool.c
@@ -70,6 +70,7 @@ static struct test_cmd cmds[] = {
 	{ "serve-v2", cmd__serve_v2 },
 	{ "sha1", cmd__sha1 },
 	{ "sha1-is-sha1dc", cmd__sha1_is_sha1dc },
+	{ "sha1-unsafe", cmd__sha1_unsafe },
 	{ "sha256", cmd__sha256 },
 	{ "sigchain", cmd__sigchain },
 	{ "simple-ipc", cmd__simple_ipc },
diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index 21802ac27da..24149edd414 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h
@@ -63,6 +63,7 @@ int cmd__scrap_cache_tree(int argc, const char **argv);
 int cmd__serve_v2(int argc, const char **argv);
 int cmd__sha1(int argc, const char **argv);
 int cmd__sha1_is_sha1dc(int argc, const char **argv);
+int cmd__sha1_unsafe(int argc, const char **argv);
 int cmd__sha256(int argc, const char **argv);
 int cmd__sigchain(int argc, const char **argv);
 int cmd__simple_ipc(int argc, const char **argv);
@@ -81,6 +82,6 @@ int cmd__windows_named_pipe(int argc, const char **argv);
 #endif
 int cmd__write_cache(int argc, const char **argv);
 
-int cmd_hash_impl(int ac, const char **av, int algo);
+int cmd_hash_impl(int ac, const char **av, int algo, int unsafe);
 
 #endif
-- 
2.48.0.rc2.35.g0c4d006e6e8


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v3 2/8] csum-file: store the hash algorithm as a struct field
  2025-01-17 22:03 ` [PATCH v3 " Taylor Blau
  2025-01-17 22:03   ` [PATCH v3 1/8] t/helper/test-tool: implement sha1-unsafe helper Taylor Blau
@ 2025-01-17 22:03   ` Taylor Blau
  2025-01-17 22:03   ` [PATCH v3 3/8] csum-file.c: extract algop from hashfile_checksum_valid() Taylor Blau
                     ` (6 subsequent siblings)
  8 siblings, 0 replies; 60+ messages in thread
From: Taylor Blau @ 2025-01-17 22:03 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Junio C Hamano, Elijah Newren, Patrick Steinhardt

Throughout the hashfile API, we rely on a reference to 'the_hash_algo',
and call its _unsafe function variants directly.

Prepare for a future change where we may use a different 'git_hash_algo'
pointer (instead of just relying on 'the_hash_algo' throughout) by
making the 'git_hash_algo' pointer a member of the 'hashfile' structure
itself.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 csum-file.c | 20 +++++++++++---------
 csum-file.h |  1 +
 2 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/csum-file.c b/csum-file.c
index 5716016e12e..b28cd047e3f 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -50,7 +50,7 @@ void hashflush(struct hashfile *f)
 
 	if (offset) {
 		if (!f->skip_hash)
-			the_hash_algo->unsafe_update_fn(&f->ctx, f->buffer, offset);
+			f->algop->unsafe_update_fn(&f->ctx, f->buffer, offset);
 		flush(f, f->buffer, offset);
 		f->offset = 0;
 	}
@@ -71,14 +71,14 @@ int finalize_hashfile(struct hashfile *f, unsigned char *result,
 	hashflush(f);
 
 	if (f->skip_hash)
-		hashclr(f->buffer, the_repository->hash_algo);
+		hashclr(f->buffer, f->algop);
 	else
-		the_hash_algo->unsafe_final_fn(f->buffer, &f->ctx);
+		f->algop->unsafe_final_fn(f->buffer, &f->ctx);
 
 	if (result)
-		hashcpy(result, f->buffer, the_repository->hash_algo);
+		hashcpy(result, f->buffer, f->algop);
 	if (flags & CSUM_HASH_IN_STREAM)
-		flush(f, f->buffer, the_hash_algo->rawsz);
+		flush(f, f->buffer, f->algop->rawsz);
 	if (flags & CSUM_FSYNC)
 		fsync_component_or_die(component, f->fd, f->name);
 	if (flags & CSUM_CLOSE) {
@@ -128,7 +128,7 @@ void hashwrite(struct hashfile *f, const void *buf, unsigned int count)
 			 * f->offset is necessarily zero.
 			 */
 			if (!f->skip_hash)
-				the_hash_algo->unsafe_update_fn(&f->ctx, buf, nr);
+				f->algop->unsafe_update_fn(&f->ctx, buf, nr);
 			flush(f, buf, nr);
 		} else {
 			/*
@@ -174,7 +174,9 @@ static struct hashfile *hashfd_internal(int fd, const char *name,
 	f->name = name;
 	f->do_crc = 0;
 	f->skip_hash = 0;
-	the_hash_algo->unsafe_init_fn(&f->ctx);
+
+	f->algop = the_hash_algo;
+	f->algop->unsafe_init_fn(&f->ctx);
 
 	f->buffer_len = buffer_len;
 	f->buffer = xmalloc(buffer_len);
@@ -208,7 +210,7 @@ void hashfile_checkpoint(struct hashfile *f, struct hashfile_checkpoint *checkpo
 {
 	hashflush(f);
 	checkpoint->offset = f->total;
-	the_hash_algo->unsafe_clone_fn(&checkpoint->ctx, &f->ctx);
+	f->algop->unsafe_clone_fn(&checkpoint->ctx, &f->ctx);
 }
 
 int hashfile_truncate(struct hashfile *f, struct hashfile_checkpoint *checkpoint)
@@ -219,7 +221,7 @@ int hashfile_truncate(struct hashfile *f, struct hashfile_checkpoint *checkpoint
 	    lseek(f->fd, offset, SEEK_SET) != offset)
 		return -1;
 	f->total = offset;
-	the_hash_algo->unsafe_clone_fn(&f->ctx, &checkpoint->ctx);
+	f->algop->unsafe_clone_fn(&f->ctx, &checkpoint->ctx);
 	f->offset = 0; /* hashflush() was called in checkpoint */
 	return 0;
 }
diff --git a/csum-file.h b/csum-file.h
index 7c73da0a40a..2b45f4673a2 100644
--- a/csum-file.h
+++ b/csum-file.h
@@ -20,6 +20,7 @@ struct hashfile {
 	size_t buffer_len;
 	unsigned char *buffer;
 	unsigned char *check_buffer;
+	const struct git_hash_algo *algop;
 
 	/**
 	 * If non-zero, skip_hash indicates that we should
-- 
2.48.0.rc2.35.g0c4d006e6e8


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v3 3/8] csum-file.c: extract algop from hashfile_checksum_valid()
  2025-01-17 22:03 ` [PATCH v3 " Taylor Blau
  2025-01-17 22:03   ` [PATCH v3 1/8] t/helper/test-tool: implement sha1-unsafe helper Taylor Blau
  2025-01-17 22:03   ` [PATCH v3 2/8] csum-file: store the hash algorithm as a struct field Taylor Blau
@ 2025-01-17 22:03   ` Taylor Blau
  2025-01-17 22:03   ` [PATCH v3 4/8] hash.h: introduce `unsafe_hash_algo()` Taylor Blau
                     ` (5 subsequent siblings)
  8 siblings, 0 replies; 60+ messages in thread
From: Taylor Blau @ 2025-01-17 22:03 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Junio C Hamano, Elijah Newren, Patrick Steinhardt

Perform a similar transformation as in the previous commit, but focused
instead on hashfile_checksum_valid(). This function does not work with a
hashfile structure itself, and instead validates the raw contents of a
file written using the hashfile API.

We'll want to be prepared for a similar change to this function in the
future, so prepare ourselves for that by extracting 'the_hash_algo' into
its own field for use within this function.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 csum-file.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/csum-file.c b/csum-file.c
index b28cd047e3f..7a71121e340 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -242,14 +242,15 @@ int hashfile_checksum_valid(const unsigned char *data, size_t total_len)
 {
 	unsigned char got[GIT_MAX_RAWSZ];
 	git_hash_ctx ctx;
-	size_t data_len = total_len - the_hash_algo->rawsz;
+	const struct git_hash_algo *algop = the_hash_algo;
+	size_t data_len = total_len - algop->rawsz;
 
-	if (total_len < the_hash_algo->rawsz)
+	if (total_len < algop->rawsz)
 		return 0; /* say "too short"? */
 
-	the_hash_algo->unsafe_init_fn(&ctx);
-	the_hash_algo->unsafe_update_fn(&ctx, data, data_len);
-	the_hash_algo->unsafe_final_fn(got, &ctx);
+	algop->unsafe_init_fn(&ctx);
+	algop->unsafe_update_fn(&ctx, data, data_len);
+	algop->unsafe_final_fn(got, &ctx);
 
-	return hasheq(got, data + data_len, the_repository->hash_algo);
+	return hasheq(got, data + data_len, algop);
 }
-- 
2.48.0.rc2.35.g0c4d006e6e8


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v3 4/8] hash.h: introduce `unsafe_hash_algo()`
  2025-01-17 22:03 ` [PATCH v3 " Taylor Blau
                     ` (2 preceding siblings ...)
  2025-01-17 22:03   ` [PATCH v3 3/8] csum-file.c: extract algop from hashfile_checksum_valid() Taylor Blau
@ 2025-01-17 22:03   ` Taylor Blau
  2025-01-17 22:03   ` [PATCH v3 5/8] csum-file.c: use unsafe_hash_algo() Taylor Blau
                     ` (4 subsequent siblings)
  8 siblings, 0 replies; 60+ messages in thread
From: Taylor Blau @ 2025-01-17 22:03 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Junio C Hamano, Elijah Newren, Patrick Steinhardt

In 253ed9ecff (hash.h: scaffolding for _unsafe hashing variants,
2024-09-26), we introduced "unsafe" variants of the SHA-1 hashing
functions by introducing new functions like "unsafe_init_fn()" and so
on.

This approach has a major shortcoming that callers must remember to
consistently use one variant or the other. Failing to consistently use
(or not use) the unsafe variants can lead to crashes at best, or subtle
memory corruption issues at worst.

In the hashfile API, this isn't difficult to achieve, but verifying that
all callers consistently use the unsafe variants is somewhat of a chore
given how spread out all of the callers are. In the sha1 and sha1-unsafe
test helpers, all of the calls to various hash functions are guarded by
an "if (unsafe)" conditional, which is repetitive and cumbersome.

Address these issues by introducing a new pattern whereby one
'git_hash_algo' can return a pointer to another 'git_hash_algo' that
represents the unsafe version of itself. So instead of having something
like:

    if (unsafe)
      the_hash_algo->init_fn(...);
      the_hash_algo->update_fn(...);
      the_hash_algo->final_fn(...);
    else
      the_hash_algo->unsafe_init_fn(...);
      the_hash_algo->unsafe_update_fn(...);
      the_hash_algo->unsafe_final_fn(...);

we can instead write:

    struct git_hash_algo *algop = the_hash_algo;
    if (unsafe)
      algop = unsafe_hash_algo(algop);

    algop->init_fn(...);
    algop->update_fn(...);
    algop->final_fn(...);

This removes the existing shortcoming by no longer forcing the caller to
"remember" which variant of the hash functions it wants to call, only to
hold onto a 'struct git_hash_algo' pointer that is initialized once.

Similarly, while there currently is still a way to "mix" safe and unsafe
functions, this too will go away after subsequent commits remove all
direct calls to the unsafe_ variants.

Note that hash_algo_by_ptr() needs an adjustment to allow passing in the
unsafe variant of a hash function. All other query functions on the
hash_algos array will continue to return the safe variants of any
function.

Suggested-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 hash.h        | 13 ++++++++++++-
 object-file.c | 26 ++++++++++++++++++++++++++
 2 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/hash.h b/hash.h
index 756166ce5e8..a68a2b6a161 100644
--- a/hash.h
+++ b/hash.h
@@ -305,6 +305,9 @@ struct git_hash_algo {
 
 	/* The all-zeros OID. */
 	const struct object_id *null_oid;
+
+	/* The unsafe variant of this hash function, if one exists. */
+	const struct git_hash_algo *unsafe;
 };
 extern const struct git_hash_algo hash_algos[GIT_HASH_NALGOS];
 
@@ -320,9 +323,17 @@ int hash_algo_by_length(int len);
 /* Identical, except for a pointer to struct git_hash_algo. */
 static inline int hash_algo_by_ptr(const struct git_hash_algo *p)
 {
-	return p - hash_algos;
+	size_t i;
+	for (i = 0; i < GIT_HASH_NALGOS; i++) {
+		const struct git_hash_algo *algop = &hash_algos[i];
+		if (p == algop || (algop->unsafe && p == algop->unsafe))
+			return i;
+	}
+	return GIT_HASH_UNKNOWN;
 }
 
+const struct git_hash_algo *unsafe_hash_algo(const struct git_hash_algo *algop);
+
 const struct object_id *null_oid(void);
 
 static inline int hashcmp(const unsigned char *sha1, const unsigned char *sha2, const struct git_hash_algo *algop)
diff --git a/object-file.c b/object-file.c
index 5b792b3dd42..43efa4ca361 100644
--- a/object-file.c
+++ b/object-file.c
@@ -202,6 +202,22 @@ static void git_hash_unknown_final_oid(struct object_id *oid UNUSED,
 	BUG("trying to finalize unknown hash");
 }
 
+static const struct git_hash_algo sha1_unsafe_algo = {
+	.name = "sha1",
+	.format_id = GIT_SHA1_FORMAT_ID,
+	.rawsz = GIT_SHA1_RAWSZ,
+	.hexsz = GIT_SHA1_HEXSZ,
+	.blksz = GIT_SHA1_BLKSZ,
+	.init_fn = git_hash_sha1_init_unsafe,
+	.clone_fn = git_hash_sha1_clone_unsafe,
+	.update_fn = git_hash_sha1_update_unsafe,
+	.final_fn = git_hash_sha1_final_unsafe,
+	.final_oid_fn = git_hash_sha1_final_oid_unsafe,
+	.empty_tree = &empty_tree_oid,
+	.empty_blob = &empty_blob_oid,
+	.null_oid = &null_oid_sha1,
+};
+
 const struct git_hash_algo hash_algos[GIT_HASH_NALGOS] = {
 	{
 		.name = NULL,
@@ -239,6 +255,7 @@ const struct git_hash_algo hash_algos[GIT_HASH_NALGOS] = {
 		.unsafe_update_fn = git_hash_sha1_update_unsafe,
 		.unsafe_final_fn = git_hash_sha1_final_unsafe,
 		.unsafe_final_oid_fn = git_hash_sha1_final_oid_unsafe,
+		.unsafe = &sha1_unsafe_algo,
 		.empty_tree = &empty_tree_oid,
 		.empty_blob = &empty_blob_oid,
 		.null_oid = &null_oid_sha1,
@@ -305,6 +322,15 @@ int hash_algo_by_length(int len)
 	return GIT_HASH_UNKNOWN;
 }
 
+const struct git_hash_algo *unsafe_hash_algo(const struct git_hash_algo *algop)
+{
+	/* If we have a faster "unsafe" implementation, use that. */
+	if (algop->unsafe)
+		return algop->unsafe;
+	/* Otherwise use the default one. */
+	return algop;
+}
+
 /*
  * This is meant to hold a *small* number of objects that you would
  * want repo_read_object_file() to be able to return, but yet you do not want
-- 
2.48.0.rc2.35.g0c4d006e6e8


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v3 5/8] csum-file.c: use unsafe_hash_algo()
  2025-01-17 22:03 ` [PATCH v3 " Taylor Blau
                     ` (3 preceding siblings ...)
  2025-01-17 22:03   ` [PATCH v3 4/8] hash.h: introduce `unsafe_hash_algo()` Taylor Blau
@ 2025-01-17 22:03   ` Taylor Blau
  2025-01-17 22:03   ` [PATCH v3 6/8] t/helper/test-hash.c: " Taylor Blau
                     ` (3 subsequent siblings)
  8 siblings, 0 replies; 60+ messages in thread
From: Taylor Blau @ 2025-01-17 22:03 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Junio C Hamano, Elijah Newren, Patrick Steinhardt

Instead of calling the unsafe_ hash function variants directly, make use
of the shared 'algop' pointer by initializing it to:

    f->algop = unsafe_hash_algo(the_hash_algo);

, thus making all calls use the unsafe variants directly.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 csum-file.c | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/csum-file.c b/csum-file.c
index 7a71121e340..ebffc80ef71 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -50,7 +50,7 @@ void hashflush(struct hashfile *f)
 
 	if (offset) {
 		if (!f->skip_hash)
-			f->algop->unsafe_update_fn(&f->ctx, f->buffer, offset);
+			f->algop->update_fn(&f->ctx, f->buffer, offset);
 		flush(f, f->buffer, offset);
 		f->offset = 0;
 	}
@@ -73,7 +73,7 @@ int finalize_hashfile(struct hashfile *f, unsigned char *result,
 	if (f->skip_hash)
 		hashclr(f->buffer, f->algop);
 	else
-		f->algop->unsafe_final_fn(f->buffer, &f->ctx);
+		f->algop->final_fn(f->buffer, &f->ctx);
 
 	if (result)
 		hashcpy(result, f->buffer, f->algop);
@@ -128,7 +128,7 @@ void hashwrite(struct hashfile *f, const void *buf, unsigned int count)
 			 * f->offset is necessarily zero.
 			 */
 			if (!f->skip_hash)
-				f->algop->unsafe_update_fn(&f->ctx, buf, nr);
+				f->algop->update_fn(&f->ctx, buf, nr);
 			flush(f, buf, nr);
 		} else {
 			/*
@@ -175,8 +175,8 @@ static struct hashfile *hashfd_internal(int fd, const char *name,
 	f->do_crc = 0;
 	f->skip_hash = 0;
 
-	f->algop = the_hash_algo;
-	f->algop->unsafe_init_fn(&f->ctx);
+	f->algop = unsafe_hash_algo(the_hash_algo);
+	f->algop->init_fn(&f->ctx);
 
 	f->buffer_len = buffer_len;
 	f->buffer = xmalloc(buffer_len);
@@ -210,7 +210,7 @@ void hashfile_checkpoint(struct hashfile *f, struct hashfile_checkpoint *checkpo
 {
 	hashflush(f);
 	checkpoint->offset = f->total;
-	f->algop->unsafe_clone_fn(&checkpoint->ctx, &f->ctx);
+	f->algop->clone_fn(&checkpoint->ctx, &f->ctx);
 }
 
 int hashfile_truncate(struct hashfile *f, struct hashfile_checkpoint *checkpoint)
@@ -221,7 +221,7 @@ int hashfile_truncate(struct hashfile *f, struct hashfile_checkpoint *checkpoint
 	    lseek(f->fd, offset, SEEK_SET) != offset)
 		return -1;
 	f->total = offset;
-	f->algop->unsafe_clone_fn(&f->ctx, &checkpoint->ctx);
+	f->algop->clone_fn(&f->ctx, &checkpoint->ctx);
 	f->offset = 0; /* hashflush() was called in checkpoint */
 	return 0;
 }
@@ -242,15 +242,15 @@ int hashfile_checksum_valid(const unsigned char *data, size_t total_len)
 {
 	unsigned char got[GIT_MAX_RAWSZ];
 	git_hash_ctx ctx;
-	const struct git_hash_algo *algop = the_hash_algo;
+	const struct git_hash_algo *algop = unsafe_hash_algo(the_hash_algo);
 	size_t data_len = total_len - algop->rawsz;
 
 	if (total_len < algop->rawsz)
 		return 0; /* say "too short"? */
 
-	algop->unsafe_init_fn(&ctx);
-	algop->unsafe_update_fn(&ctx, data, data_len);
-	algop->unsafe_final_fn(got, &ctx);
+	algop->init_fn(&ctx);
+	algop->update_fn(&ctx, data, data_len);
+	algop->final_fn(got, &ctx);
 
 	return hasheq(got, data + data_len, algop);
 }
-- 
2.48.0.rc2.35.g0c4d006e6e8


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v3 6/8] t/helper/test-hash.c: use unsafe_hash_algo()
  2025-01-17 22:03 ` [PATCH v3 " Taylor Blau
                     ` (4 preceding siblings ...)
  2025-01-17 22:03   ` [PATCH v3 5/8] csum-file.c: use unsafe_hash_algo() Taylor Blau
@ 2025-01-17 22:03   ` Taylor Blau
  2025-01-17 22:03   ` [PATCH v3 7/8] csum-file: introduce hashfile_checkpoint_init() Taylor Blau
                     ` (2 subsequent siblings)
  8 siblings, 0 replies; 60+ messages in thread
From: Taylor Blau @ 2025-01-17 22:03 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Junio C Hamano, Elijah Newren, Patrick Steinhardt

Remove a series of conditionals within the shared cmd_hash_impl() helper
that powers the 'sha1' and 'sha1-unsafe' helpers.

Instead, replace them with a single conditional that transforms the
specified hash algorithm into its unsafe variant. Then all subsequent
calls can directly use whatever function it wants to call without having
to decide between the safe and unsafe variants.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/helper/test-hash.c | 17 +++++------------
 1 file changed, 5 insertions(+), 12 deletions(-)

diff --git a/t/helper/test-hash.c b/t/helper/test-hash.c
index d0ee668df95..aa82638c621 100644
--- a/t/helper/test-hash.c
+++ b/t/helper/test-hash.c
@@ -9,6 +9,8 @@ int cmd_hash_impl(int ac, const char **av, int algo, int unsafe)
 	int binary = 0;
 	char *buffer;
 	const struct git_hash_algo *algop = &hash_algos[algo];
+	if (unsafe)
+		algop = unsafe_hash_algo(algop);
 
 	if (ac == 2) {
 		if (!strcmp(av[1], "-b"))
@@ -27,10 +29,7 @@ int cmd_hash_impl(int ac, const char **av, int algo, int unsafe)
 			die("OOPS");
 	}
 
-	if (unsafe)
-		algop->unsafe_init_fn(&ctx);
-	else
-		algop->init_fn(&ctx);
+	algop->init_fn(&ctx);
 
 	while (1) {
 		ssize_t sz, this_sz;
@@ -49,15 +48,9 @@ int cmd_hash_impl(int ac, const char **av, int algo, int unsafe)
 		}
 		if (this_sz == 0)
 			break;
-		if (unsafe)
-			algop->unsafe_update_fn(&ctx, buffer, this_sz);
-		else
-			algop->update_fn(&ctx, buffer, this_sz);
+		algop->update_fn(&ctx, buffer, this_sz);
 	}
-	if (unsafe)
-		algop->unsafe_final_fn(hash, &ctx);
-	else
-		algop->final_fn(hash, &ctx);
+	algop->final_fn(hash, &ctx);
 
 	if (binary)
 		fwrite(hash, 1, algop->rawsz, stdout);
-- 
2.48.0.rc2.35.g0c4d006e6e8


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v3 7/8] csum-file: introduce hashfile_checkpoint_init()
  2025-01-17 22:03 ` [PATCH v3 " Taylor Blau
                     ` (5 preceding siblings ...)
  2025-01-17 22:03   ` [PATCH v3 6/8] t/helper/test-hash.c: " Taylor Blau
@ 2025-01-17 22:03   ` Taylor Blau
  2025-01-17 22:03   ` [PATCH v3 8/8] hash.h: drop unsafe_ function variants Taylor Blau
  2025-01-18 12:28   ` [PATCH v3 0/8] hash: introduce unsafe_hash_algo(), drop unsafe_ variants Jeff King
  8 siblings, 0 replies; 60+ messages in thread
From: Taylor Blau @ 2025-01-17 22:03 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Junio C Hamano, Elijah Newren, Patrick Steinhardt

In 106140a99f (builtin/fast-import: fix segfault with unsafe SHA1
backend, 2024-12-30) and 9218c0bfe1 (bulk-checkin: fix segfault with
unsafe SHA1 backend, 2024-12-30), we observed the effects of failing to
initialize a hashfile_checkpoint with the same hash function
implementation as is used by the hashfile it is used to checkpoint.

While both 106140a99f and 9218c0bfe1 work around the immediate crash,
changing the hash function implementation within the hashfile API to,
for example, the non-unsafe variant would re-introduce the crash. This
is a result of the tight coupling between initializing hashfiles and
hashfile_checkpoints.

Introduce and use a new function which ensures that both parts of a
hashfile and hashfile_checkpoint pair use the same hash function
implementation to avoid such crashes.

A few things worth noting:

  - In the change to builtin/fast-import.c::stream_blob(), we can see
    that by removing the explicit reference to
    'the_hash_algo->unsafe_init_fn()', we are hardened against the
    hashfile API changing away from the_hash_algo (or its unsafe
    variant) in the future.

  - The bulk-checkin code no longer needs to explicitly zero-initialize
    the hashfile_checkpoint, since it is now done as a result of calling
    'hashfile_checkpoint_init()'.

  - Also in the bulk-checkin code, we add an additional call to
    prepare_to_stream() outside of the main loop in order to initialize
    'state->f' so we know which hash function implementation to use when
    calling 'hashfile_checkpoint_init()'.

    This is OK, since subsequent 'prepare_to_stream()' calls are noops.
    However, we only need to call 'prepare_to_stream()' when we have the
    HASH_WRITE_OBJECT bit set in our flags. Without that bit, calling
    'prepare_to_stream()' does not assign 'state->f', so we have nothing
    to initialize.

  - Other uses of the 'checkpoint' in 'deflate_blob_to_pack()' are
    appropriately guarded.

Helped-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/fast-import.c | 2 +-
 bulk-checkin.c        | 9 ++++++---
 csum-file.c           | 7 +++++++
 csum-file.h           | 1 +
 4 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/builtin/fast-import.c b/builtin/fast-import.c
index 0f86392761a..4a6c7ab52ac 100644
--- a/builtin/fast-import.c
+++ b/builtin/fast-import.c
@@ -1106,7 +1106,7 @@ static void stream_blob(uintmax_t len, struct object_id *oidout, uintmax_t mark)
 		|| (pack_size + PACK_SIZE_THRESHOLD + len) < pack_size)
 		cycle_packfile();
 
-	the_hash_algo->unsafe_init_fn(&checkpoint.ctx);
+	hashfile_checkpoint_init(pack_file, &checkpoint);
 	hashfile_checkpoint(pack_file, &checkpoint);
 	offset = checkpoint.offset;
 
diff --git a/bulk-checkin.c b/bulk-checkin.c
index 433070a3bda..892176d23d2 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -261,7 +261,7 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 	git_hash_ctx ctx;
 	unsigned char obuf[16384];
 	unsigned header_len;
-	struct hashfile_checkpoint checkpoint = {0};
+	struct hashfile_checkpoint checkpoint;
 	struct pack_idx_entry *idx = NULL;
 
 	seekback = lseek(fd, 0, SEEK_CUR);
@@ -272,12 +272,15 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 					  OBJ_BLOB, size);
 	the_hash_algo->init_fn(&ctx);
 	the_hash_algo->update_fn(&ctx, obuf, header_len);
-	the_hash_algo->unsafe_init_fn(&checkpoint.ctx);
 
 	/* Note: idx is non-NULL when we are writing */
-	if ((flags & HASH_WRITE_OBJECT) != 0)
+	if ((flags & HASH_WRITE_OBJECT) != 0) {
 		CALLOC_ARRAY(idx, 1);
 
+		prepare_to_stream(state, flags);
+		hashfile_checkpoint_init(state->f, &checkpoint);
+	}
+
 	already_hashed_to = 0;
 
 	while (1) {
diff --git a/csum-file.c b/csum-file.c
index ebffc80ef71..232121f415f 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -206,6 +206,13 @@ struct hashfile *hashfd_throughput(int fd, const char *name, struct progress *tp
 	return hashfd_internal(fd, name, tp, 8 * 1024);
 }
 
+void hashfile_checkpoint_init(struct hashfile *f,
+			      struct hashfile_checkpoint *checkpoint)
+{
+	memset(checkpoint, 0, sizeof(*checkpoint));
+	f->algop->init_fn(&checkpoint->ctx);
+}
+
 void hashfile_checkpoint(struct hashfile *f, struct hashfile_checkpoint *checkpoint)
 {
 	hashflush(f);
diff --git a/csum-file.h b/csum-file.h
index 2b45f4673a2..b7475f16c20 100644
--- a/csum-file.h
+++ b/csum-file.h
@@ -36,6 +36,7 @@ struct hashfile_checkpoint {
 	git_hash_ctx ctx;
 };
 
+void hashfile_checkpoint_init(struct hashfile *, struct hashfile_checkpoint *);
 void hashfile_checkpoint(struct hashfile *, struct hashfile_checkpoint *);
 int hashfile_truncate(struct hashfile *, struct hashfile_checkpoint *);
 
-- 
2.48.0.rc2.35.g0c4d006e6e8


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v3 8/8] hash.h: drop unsafe_ function variants
  2025-01-17 22:03 ` [PATCH v3 " Taylor Blau
                     ` (6 preceding siblings ...)
  2025-01-17 22:03   ` [PATCH v3 7/8] csum-file: introduce hashfile_checkpoint_init() Taylor Blau
@ 2025-01-17 22:03   ` Taylor Blau
  2025-01-18 12:28   ` [PATCH v3 0/8] hash: introduce unsafe_hash_algo(), drop unsafe_ variants Jeff King
  8 siblings, 0 replies; 60+ messages in thread
From: Taylor Blau @ 2025-01-17 22:03 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Junio C Hamano, Elijah Newren, Patrick Steinhardt

Now that all callers have been converted from:

    the_hash_algo->unsafe_init_fn();

to

    unsafe_hash_algo(the_hash_algo)->init_fn();

and similar, we can remove the scaffolding for the unsafe_ function
variants and force callers to use the new unsafe_hash_algo() mechanic
instead.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 hash.h        | 15 ---------------
 object-file.c | 15 ---------------
 2 files changed, 30 deletions(-)

diff --git a/hash.h b/hash.h
index a68a2b6a161..68d4292e6da 100644
--- a/hash.h
+++ b/hash.h
@@ -282,21 +282,6 @@ struct git_hash_algo {
 	/* The hash finalization function for object IDs. */
 	git_hash_final_oid_fn final_oid_fn;
 
-	/* The non-cryptographic hash initialization function. */
-	git_hash_init_fn unsafe_init_fn;
-
-	/* The non-cryptographic hash context cloning function. */
-	git_hash_clone_fn unsafe_clone_fn;
-
-	/* The non-cryptographic hash update function. */
-	git_hash_update_fn unsafe_update_fn;
-
-	/* The non-cryptographic hash finalization function. */
-	git_hash_final_fn unsafe_final_fn;
-
-	/* The non-cryptographic hash finalization function. */
-	git_hash_final_oid_fn unsafe_final_oid_fn;
-
 	/* The OID of the empty tree. */
 	const struct object_id *empty_tree;
 
diff --git a/object-file.c b/object-file.c
index 43efa4ca361..c4b42dd4be9 100644
--- a/object-file.c
+++ b/object-file.c
@@ -230,11 +230,6 @@ const struct git_hash_algo hash_algos[GIT_HASH_NALGOS] = {
 		.update_fn = git_hash_unknown_update,
 		.final_fn = git_hash_unknown_final,
 		.final_oid_fn = git_hash_unknown_final_oid,
-		.unsafe_init_fn = git_hash_unknown_init,
-		.unsafe_clone_fn = git_hash_unknown_clone,
-		.unsafe_update_fn = git_hash_unknown_update,
-		.unsafe_final_fn = git_hash_unknown_final,
-		.unsafe_final_oid_fn = git_hash_unknown_final_oid,
 		.empty_tree = NULL,
 		.empty_blob = NULL,
 		.null_oid = NULL,
@@ -250,11 +245,6 @@ const struct git_hash_algo hash_algos[GIT_HASH_NALGOS] = {
 		.update_fn = git_hash_sha1_update,
 		.final_fn = git_hash_sha1_final,
 		.final_oid_fn = git_hash_sha1_final_oid,
-		.unsafe_init_fn = git_hash_sha1_init_unsafe,
-		.unsafe_clone_fn = git_hash_sha1_clone_unsafe,
-		.unsafe_update_fn = git_hash_sha1_update_unsafe,
-		.unsafe_final_fn = git_hash_sha1_final_unsafe,
-		.unsafe_final_oid_fn = git_hash_sha1_final_oid_unsafe,
 		.unsafe = &sha1_unsafe_algo,
 		.empty_tree = &empty_tree_oid,
 		.empty_blob = &empty_blob_oid,
@@ -271,11 +261,6 @@ const struct git_hash_algo hash_algos[GIT_HASH_NALGOS] = {
 		.update_fn = git_hash_sha256_update,
 		.final_fn = git_hash_sha256_final,
 		.final_oid_fn = git_hash_sha256_final_oid,
-		.unsafe_init_fn = git_hash_sha256_init,
-		.unsafe_clone_fn = git_hash_sha256_clone,
-		.unsafe_update_fn = git_hash_sha256_update,
-		.unsafe_final_fn = git_hash_sha256_final,
-		.unsafe_final_oid_fn = git_hash_sha256_final_oid,
 		.empty_tree = &empty_tree_oid_sha256,
 		.empty_blob = &empty_blob_oid_sha256,
 		.null_oid = &null_oid_sha256,
-- 
2.48.0.rc2.35.g0c4d006e6e8

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 0/8] hash: introduce unsafe_hash_algo(), drop unsafe_ variants
  2025-01-17 22:03 ` [PATCH v3 " Taylor Blau
                     ` (7 preceding siblings ...)
  2025-01-17 22:03   ` [PATCH v3 8/8] hash.h: drop unsafe_ function variants Taylor Blau
@ 2025-01-18 12:28   ` Jeff King
  2025-01-18 12:43     ` Jeff King
  8 siblings, 1 reply; 60+ messages in thread
From: Jeff King @ 2025-01-18 12:28 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Junio C Hamano, Elijah Newren, Patrick Steinhardt

On Fri, Jan 17, 2025 at 05:03:07PM -0500, Taylor Blau wrote:

>     +    Note that hash_algo_by_ptr() needs an adjustment to allow passing in the
>     +    unsafe variant of a hash function. All other query functions on the
>     +    hash_algos array will continue to return the safe variants of any
>     +    function.
>     +
>          Suggested-by: Jeff King <peff@peff.net>
>          Signed-off-by: Taylor Blau <me@ttaylorr.com>
>      
>     @@ hash.h: struct git_hash_algo {
>       };
>       extern const struct git_hash_algo hash_algos[GIT_HASH_NALGOS];
>       
>     -@@ hash.h: static inline int hash_algo_by_ptr(const struct git_hash_algo *p)
>     - 	return p - hash_algos;
>     +@@ hash.h: int hash_algo_by_length(int len);
>     + /* Identical, except for a pointer to struct git_hash_algo. */
>     + static inline int hash_algo_by_ptr(const struct git_hash_algo *p)
>     + {
>     +-	return p - hash_algos;
>     ++	size_t i;
>     ++	for (i = 0; i < GIT_HASH_NALGOS; i++) {
>     ++		const struct git_hash_algo *algop = &hash_algos[i];
>     ++		if (p == algop || (algop->unsafe && p == algop->unsafe))
>     ++			return i;
>     ++	}
>     ++	return GIT_HASH_UNKNOWN;
>       }

OK, so this version introduces the loop we discussed earlier. I think we
can probably dismiss any performance loss as theoretical unless somebody
can think of a good way to measure. It seems like worrying about it is
probably a premature micro-optimization.

It is a little quirky that it loses the transitive nature of
hash_algo_by_ptr() and hash_algo_by_id(). So this is unsafe:

  /* start with unsafe variant */
  algo = unsafe_hash_algo(the_hash_algo);
  algo->init_fn(...);

  /* returns GIT_HASH_SHA1, even for the unsafe variant */
  id = hash_algo_by_ptr(algo);

  /* returns the safe variant */
  algo = hash_algo_by_id(id);

  /* oops, this is mix-and-matching */
  alog->update_fn(...);

Now obviously that sort of round-trip is a silly thing to do in a single
function. Could it happen across a larger call-chain, where the id is
stored in a struct and passed somewhere far away? I guess so, but it
also seems kind of unlikely.

It does make me wonder if hash_algo_by_ptr() should just return UNKNOWN
for the unsafe variant. Then we'd at least get a loud error from the
caller (as opposed to the code before your patch, which is undefined
behavior). I dunno.

-Peff

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 0/8] hash: introduce unsafe_hash_algo(), drop unsafe_ variants
  2025-01-18 12:28   ` [PATCH v3 0/8] hash: introduce unsafe_hash_algo(), drop unsafe_ variants Jeff King
@ 2025-01-18 12:43     ` Jeff King
  2025-01-22 21:31       ` Junio C Hamano
  0 siblings, 1 reply; 60+ messages in thread
From: Jeff King @ 2025-01-18 12:43 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Junio C Hamano, Elijah Newren, Patrick Steinhardt

On Sat, Jan 18, 2025 at 07:28:31AM -0500, Jeff King wrote:

> It does make me wonder if hash_algo_by_ptr() should just return UNKNOWN
> for the unsafe variant. Then we'd at least get a loud error from the
> caller (as opposed to the code before your patch, which is undefined
> behavior). I dunno.

Doing this:

diff --git a/hash.h b/hash.h
index 68d4292e6d..ad2c919991 100644
--- a/hash.h
+++ b/hash.h
@@ -311,7 +311,7 @@ static inline int hash_algo_by_ptr(const struct git_hash_algo *p)
 	size_t i;
 	for (i = 0; i < GIT_HASH_NALGOS; i++) {
 		const struct git_hash_algo *algop = &hash_algos[i];
-		if (p == algop || (algop->unsafe && p == algop->unsafe))
+		if (p == algop)
 			return i;
 	}
 	return GIT_HASH_UNKNOWN;

on top of your series doesn't trigger any issues in the test suite.
Which I think shows that we would never have triggered the UB from the
original implementation, either[1]. Still, I think I prefer your loop to
being one errant function call away from undefined behavior. But maybe
returning UNKNOWN (as above) is a safer bet than losing the inverse
nature of the by_ptr() and by_id() functions, which could otherwise
cause hard-to-find interactions?

-Peff

[1] One thing that still puzzles me. If you further add a BUG() to that
    function when we'd return UNKNOWN, you can see that we trigger a few
    cases in the test suite where we pass in NULL (e.g., because we're
    not in a repo). I think these are already UB with the existing
    "p - hash_algos" implementation! We'll return what is effectively
    "p" cast to an int.

    E.g. if I do this:

    diff --git a/hash.h b/hash.h
    index 756166ce5e..ff31156416 100644
    --- a/hash.h
    +++ b/hash.h
    @@ -320,6 +320,8 @@ int hash_algo_by_length(int len);
     /* Identical, except for a pointer to struct git_hash_algo. */
     static inline int hash_algo_by_ptr(const struct git_hash_algo *p)
     {
    +	if (!p)
    +		BUG("null hash, return is %d", (int)(p - hash_algos));
     	return p - hash_algos;
     }

     then t1517 shows grep dying with:

       BUG: hash.h:324: null hash, return is -1646754892

     I think it "works" because the backtrace is:

      [...]
      #5  0x0000556d436f6b71 in BUG_fl (file=0x556d43790e8b "hash.h", line=324,
          fmt=0x556d43790e73 "null hash, return is %d") at usage.c:335
      #6  0x0000556d4357c2f9 in hash_algo_by_ptr (p=0x0) at /home/peff/compile/git/hash.h:324
      #7  0x0000556d4357c437 in oidclr (oid=0x7ffdf5810ea4, algop=0x0) at /home/peff/compile/git/hash.h:392
      #8  0x0000556d435801c7 in prep_exclude (dir=0x7ffdf5811190, istate=0x556d608959c0, base=0x556d6089da50 "nums",
          baselen=0) at dir.c:1699
      [...]
      #16 0x0000556d4344dd4a in cmd_grep (argc=0, argv=0x556d60895ee8, prefix=0x0, repo=0x0) at builtin/grep.c:1257

     So we probably write a totally garbage algo field into the
     object_id, but nobody ever ends up looking at it. Probably
     something we should clean up, but way out of scope for your series.
     But I do think it reinforces that returning UNKNOWN is an
     improvement; it avoids undefined behavior and anybody who tried to
     use it should get a BUG() from calling git_hash_unknown_*()
     functions.


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 0/8] hash: introduce unsafe_hash_algo(), drop unsafe_ variants
  2025-01-18 12:43     ` Jeff King
@ 2025-01-22 21:31       ` Junio C Hamano
  0 siblings, 0 replies; 60+ messages in thread
From: Junio C Hamano @ 2025-01-22 21:31 UTC (permalink / raw)
  To: Jeff King; +Cc: Taylor Blau, git, Elijah Newren, Patrick Steinhardt

Jeff King <peff@peff.net> writes:

> .... But maybe
> returning UNKNOWN (as above) is a safer bet than losing the inverse
> nature of the by_ptr() and by_id() functions, which could otherwise
> cause hard-to-find interactions?

Yeah, that sounds sensible.

>      I think it "works" because the backtrace is:
>
>       [...]
>       #5  0x0000556d436f6b71 in BUG_fl (file=0x556d43790e8b "hash.h", line=324,
>           fmt=0x556d43790e73 "null hash, return is %d") at usage.c:335
>       #6  0x0000556d4357c2f9 in hash_algo_by_ptr (p=0x0) at /home/peff/compile/git/hash.h:324
>       #7  0x0000556d4357c437 in oidclr (oid=0x7ffdf5810ea4, algop=0x0) at /home/peff/compile/git/hash.h:392
>       #8  0x0000556d435801c7 in prep_exclude (dir=0x7ffdf5811190, istate=0x556d608959c0, base=0x556d6089da50 "nums",
>           baselen=0) at dir.c:1699
>       [...]
>       #16 0x0000556d4344dd4a in cmd_grep (argc=0, argv=0x556d60895ee8, prefix=0x0, repo=0x0) at builtin/grep.c:1257
>
>      So we probably write a totally garbage algo field into the
>      object_id, but nobody ever ends up looking at it. Probably
>      something we should clean up, but way out of scope for your series.

Yeah, that looks bad, but I think it is fine to leave it outside the
series to fix that.

>      But I do think it reinforces that returning UNKNOWN is an
>      improvement; it avoids undefined behavior and anybody who tried to
>      use it should get a BUG() from calling git_hash_unknown_*()
>      functions.

Sounds like a sensitive direction to go in.

Thanks.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v4 0/8] hash: introduce unsafe_hash_algo(), drop unsafe_ variants
  2024-11-20 19:13 [PATCH 0/6] hash: introduce unsafe_hash_algo(), drop unsafe_ variants Taylor Blau
                   ` (7 preceding siblings ...)
  2025-01-17 22:03 ` [PATCH v3 " Taylor Blau
@ 2025-01-23 17:34 ` Taylor Blau
  2025-01-23 17:34   ` [PATCH v4 1/8] t/helper/test-tool: implement sha1-unsafe helper Taylor Blau
                     ` (8 more replies)
  8 siblings, 9 replies; 60+ messages in thread
From: Taylor Blau @ 2025-01-23 17:34 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Junio C Hamano, Elijah Newren, Patrick Steinhardt

(This series is based on 14650065b7 (RelNotes/2.48.0: fix typos etc.,
2025-01-07)).

Here is a hopefully final version of my series to harden the unsafe hash
algorithm changes added in v2.47.0. The only difference from last time
is that hash_algo_by_ptr() now returns GIT_HASH_UNKNOWN for NULL and
unsafe variants, which is a strict improvement from both v3 of this
series and the status-quo on master.

As usual, a (small) range-diff is included below for convenience, and
the original cover letter is as follows:

------------

This series implements an idea discussed in [2] which suggests that we
introduce a way to access a wrapped version of a 'struct git_hash_algo'
which represents the unsafe variant of that algorithm, rather than
having individual unsafe_ functions (like unsafe_init_fn() versus
init_fn(), etc.).

This approach is relatively straightforward to implement, and removes a
significant deficiency in the original implementation of
unsafe/non-cryptographic hash functions by making it impossible to
switch between safe- and unsafe variants of hash functions. It also
cleans up the sha1-unsafe test helper's implementation by removing a
large number of "if (unsafe)"-style conditionals.

The series is laid out as follows:

  * The first two patches prepare the hashfile API for the upcoming
    change:

      csum-file: store the hash algorithm as a struct field
      csum-file.c: extract algop from hashfile_checksum_valid()

  * The next patch implements the new 'unsafe_hash_algo()' function at
    the heart of this series' approach:

      hash.h: introduce `unsafe_hash_algo()`

  * The next two patches convert existing callers to use the new
    'unsafe_hash_algo()' function, instead of switching between safe and
    unsafe_ variants of individual functions:

      csum-file.c: use unsafe_hash_algo()
      t/helper/test-hash.c: use unsafe_hash_algo()

  * The final patch drops the unsafe_ function variants following all
    callers being converted to use the new pattern:

      hash.h: drop unsafe_ function variants

Thanks in advance for your review!

[1]: https://lore.kernel.org/git/20241230-pks-meson-sha1-unsafe-v1-0-efb276e171f5@pks.im/
[2]: https://lore.kernel.org/git/20241107013915.GA961214@coredump.intra.peff.net/

Taylor Blau (8):
  t/helper/test-tool: implement sha1-unsafe helper
  csum-file: store the hash algorithm as a struct field
  csum-file.c: extract algop from hashfile_checksum_valid()
  hash.h: introduce `unsafe_hash_algo()`
  csum-file.c: use unsafe_hash_algo()
  t/helper/test-hash.c: use unsafe_hash_algo()
  csum-file: introduce hashfile_checkpoint_init()
  hash.h: drop unsafe_ function variants

 builtin/fast-import.c  |  2 +-
 bulk-checkin.c         |  9 ++++++---
 csum-file.c            | 40 +++++++++++++++++++++++++---------------
 csum-file.h            |  2 ++
 hash.h                 | 28 ++++++++++++----------------
 object-file.c          | 41 ++++++++++++++++++++++++++---------------
 t/helper/test-hash.c   |  4 +++-
 t/helper/test-sha1.c   |  7 ++++++-
 t/helper/test-sha1.sh  | 38 ++++++++++++++++++++++----------------
 t/helper/test-sha256.c |  2 +-
 t/helper/test-tool.c   |  1 +
 t/helper/test-tool.h   |  3 ++-
 12 files changed, 107 insertions(+), 70 deletions(-)

Range-diff against v3:
1:  ae6b8c75294 = 1:  b64ae238248 t/helper/test-tool: implement sha1-unsafe helper
2:  2b79c76e471 = 2:  d03f503682f csum-file: store the hash algorithm as a struct field
3:  d7deb3f338e = 3:  73554c3b881 csum-file.c: extract algop from hashfile_checksum_valid()
4:  b6b2bb2714f ! 4:  ae01f1f4274 hash.h: introduce `unsafe_hash_algo()`
    @@ hash.h: int hash_algo_by_length(int len);
     +	size_t i;
     +	for (i = 0; i < GIT_HASH_NALGOS; i++) {
     +		const struct git_hash_algo *algop = &hash_algos[i];
    -+		if (p == algop || (algop->unsafe && p == algop->unsafe))
    ++		if (p == algop)
     +			return i;
     +	}
     +	return GIT_HASH_UNKNOWN;
5:  ca67de80971 = 5:  64a850c77ae csum-file.c: use unsafe_hash_algo()
6:  21b175b07ff = 6:  3dcccccf752 t/helper/test-hash.c: use unsafe_hash_algo()
7:  850d4f407db = 7:  da97157c4a1 csum-file: introduce hashfile_checkpoint_init()
8:  0c4d006e6e8 = 8:  0ba27182b5e hash.h: drop unsafe_ function variants

base-commit: 14650065b76b28d3cfa9453356ac5669b19e706e
-- 
2.48.0.rc2.35.gd215225db14

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v4 1/8] t/helper/test-tool: implement sha1-unsafe helper
  2025-01-23 17:34 ` [PATCH v4 " Taylor Blau
@ 2025-01-23 17:34   ` Taylor Blau
  2025-01-23 17:34   ` [PATCH v4 2/8] csum-file: store the hash algorithm as a struct field Taylor Blau
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 60+ messages in thread
From: Taylor Blau @ 2025-01-23 17:34 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Junio C Hamano, Elijah Newren, Patrick Steinhardt

With the new "unsafe" SHA-1 build knob, it is convenient to have a
test-tool that can exercise Git's unsafe SHA-1 wrappers for testing,
similar to 't/helper/test-tool sha1'.

Implement that helper by altering the implementation of that test-tool
(in cmd_hash_impl(), which is generic and parameterized over different
hash functions) to conditionally run the unsafe variants of the chosen
hash function, and expose the new behavior via a new 'sha1-unsafe' test
helper.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/helper/test-hash.c   | 17 +++++++++++++----
 t/helper/test-sha1.c   |  7 ++++++-
 t/helper/test-sha1.sh  | 38 ++++++++++++++++++++++----------------
 t/helper/test-sha256.c |  2 +-
 t/helper/test-tool.c   |  1 +
 t/helper/test-tool.h   |  3 ++-
 6 files changed, 45 insertions(+), 23 deletions(-)

diff --git a/t/helper/test-hash.c b/t/helper/test-hash.c
index 45d829c908f..d0ee668df95 100644
--- a/t/helper/test-hash.c
+++ b/t/helper/test-hash.c
@@ -1,7 +1,7 @@
 #include "test-tool.h"
 #include "hex.h"
 
-int cmd_hash_impl(int ac, const char **av, int algo)
+int cmd_hash_impl(int ac, const char **av, int algo, int unsafe)
 {
 	git_hash_ctx ctx;
 	unsigned char hash[GIT_MAX_HEXSZ];
@@ -27,7 +27,10 @@ int cmd_hash_impl(int ac, const char **av, int algo)
 			die("OOPS");
 	}
 
-	algop->init_fn(&ctx);
+	if (unsafe)
+		algop->unsafe_init_fn(&ctx);
+	else
+		algop->init_fn(&ctx);
 
 	while (1) {
 		ssize_t sz, this_sz;
@@ -46,9 +49,15 @@ int cmd_hash_impl(int ac, const char **av, int algo)
 		}
 		if (this_sz == 0)
 			break;
-		algop->update_fn(&ctx, buffer, this_sz);
+		if (unsafe)
+			algop->unsafe_update_fn(&ctx, buffer, this_sz);
+		else
+			algop->update_fn(&ctx, buffer, this_sz);
 	}
-	algop->final_fn(hash, &ctx);
+	if (unsafe)
+		algop->unsafe_final_fn(hash, &ctx);
+	else
+		algop->final_fn(hash, &ctx);
 
 	if (binary)
 		fwrite(hash, 1, algop->rawsz, stdout);
diff --git a/t/helper/test-sha1.c b/t/helper/test-sha1.c
index e60d000c039..349540c4df8 100644
--- a/t/helper/test-sha1.c
+++ b/t/helper/test-sha1.c
@@ -3,7 +3,7 @@
 
 int cmd__sha1(int ac, const char **av)
 {
-	return cmd_hash_impl(ac, av, GIT_HASH_SHA1);
+	return cmd_hash_impl(ac, av, GIT_HASH_SHA1, 0);
 }
 
 int cmd__sha1_is_sha1dc(int argc UNUSED, const char **argv UNUSED)
@@ -13,3 +13,8 @@ int cmd__sha1_is_sha1dc(int argc UNUSED, const char **argv UNUSED)
 #endif
 	return 1;
 }
+
+int cmd__sha1_unsafe(int ac, const char **av)
+{
+	return cmd_hash_impl(ac, av, GIT_HASH_SHA1, 1);
+}
diff --git a/t/helper/test-sha1.sh b/t/helper/test-sha1.sh
index 84594885c70..bf387d3db14 100755
--- a/t/helper/test-sha1.sh
+++ b/t/helper/test-sha1.sh
@@ -3,25 +3,31 @@
 dd if=/dev/zero bs=1048576 count=100 2>/dev/null |
 /usr/bin/time t/helper/test-tool sha1 >/dev/null
 
+dd if=/dev/zero bs=1048576 count=100 2>/dev/null |
+/usr/bin/time t/helper/test-tool sha1-unsafe >/dev/null
+
 while read expect cnt pfx
 do
 	case "$expect" in '#'*) continue ;; esac
-	actual=$(
-		{
-			test -z "$pfx" || echo "$pfx"
-			dd if=/dev/zero bs=1048576 count=$cnt 2>/dev/null |
-			perl -pe 'y/\000/g/'
-		} | ./t/helper/test-tool sha1 $cnt
-	)
-	if test "$expect" = "$actual"
-	then
-		echo "OK: $expect $cnt $pfx"
-	else
-		echo >&2 "OOPS: $cnt"
-		echo >&2 "expect: $expect"
-		echo >&2 "actual: $actual"
-		exit 1
-	fi
+	for sha1 in sha1 sha1-unsafe
+	do
+		actual=$(
+			{
+				test -z "$pfx" || echo "$pfx"
+				dd if=/dev/zero bs=1048576 count=$cnt 2>/dev/null |
+				perl -pe 'y/\000/g/'
+			} | ./t/helper/test-tool $sha1 $cnt
+		)
+		if test "$expect" = "$actual"
+		then
+			echo "OK ($sha1): $expect $cnt $pfx"
+		else
+			echo >&2 "OOPS ($sha1): $cnt"
+			echo >&2 "expect ($sha1): $expect"
+			echo >&2 "actual ($sha1): $actual"
+			exit 1
+		fi
+	done
 done <<EOF
 da39a3ee5e6b4b0d3255bfef95601890afd80709 0
 3f786850e387550fdab836ed7e6dc881de23001b 0 a
diff --git a/t/helper/test-sha256.c b/t/helper/test-sha256.c
index 2fb20438f3c..7fd0aa1fcd3 100644
--- a/t/helper/test-sha256.c
+++ b/t/helper/test-sha256.c
@@ -3,5 +3,5 @@
 
 int cmd__sha256(int ac, const char **av)
 {
-	return cmd_hash_impl(ac, av, GIT_HASH_SHA256);
+	return cmd_hash_impl(ac, av, GIT_HASH_SHA256, 0);
 }
diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
index 4a7aa993ba9..958452ef12e 100644
--- a/t/helper/test-tool.c
+++ b/t/helper/test-tool.c
@@ -70,6 +70,7 @@ static struct test_cmd cmds[] = {
 	{ "serve-v2", cmd__serve_v2 },
 	{ "sha1", cmd__sha1 },
 	{ "sha1-is-sha1dc", cmd__sha1_is_sha1dc },
+	{ "sha1-unsafe", cmd__sha1_unsafe },
 	{ "sha256", cmd__sha256 },
 	{ "sigchain", cmd__sigchain },
 	{ "simple-ipc", cmd__simple_ipc },
diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index 21802ac27da..24149edd414 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h
@@ -63,6 +63,7 @@ int cmd__scrap_cache_tree(int argc, const char **argv);
 int cmd__serve_v2(int argc, const char **argv);
 int cmd__sha1(int argc, const char **argv);
 int cmd__sha1_is_sha1dc(int argc, const char **argv);
+int cmd__sha1_unsafe(int argc, const char **argv);
 int cmd__sha256(int argc, const char **argv);
 int cmd__sigchain(int argc, const char **argv);
 int cmd__simple_ipc(int argc, const char **argv);
@@ -81,6 +82,6 @@ int cmd__windows_named_pipe(int argc, const char **argv);
 #endif
 int cmd__write_cache(int argc, const char **argv);
 
-int cmd_hash_impl(int ac, const char **av, int algo);
+int cmd_hash_impl(int ac, const char **av, int algo, int unsafe);
 
 #endif
-- 
2.48.0.rc2.35.gd215225db14


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v4 2/8] csum-file: store the hash algorithm as a struct field
  2025-01-23 17:34 ` [PATCH v4 " Taylor Blau
  2025-01-23 17:34   ` [PATCH v4 1/8] t/helper/test-tool: implement sha1-unsafe helper Taylor Blau
@ 2025-01-23 17:34   ` Taylor Blau
  2025-01-23 17:34   ` [PATCH v4 3/8] csum-file.c: extract algop from hashfile_checksum_valid() Taylor Blau
                     ` (6 subsequent siblings)
  8 siblings, 0 replies; 60+ messages in thread
From: Taylor Blau @ 2025-01-23 17:34 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Junio C Hamano, Elijah Newren, Patrick Steinhardt

Throughout the hashfile API, we rely on a reference to 'the_hash_algo',
and call its _unsafe function variants directly.

Prepare for a future change where we may use a different 'git_hash_algo'
pointer (instead of just relying on 'the_hash_algo' throughout) by
making the 'git_hash_algo' pointer a member of the 'hashfile' structure
itself.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 csum-file.c | 20 +++++++++++---------
 csum-file.h |  1 +
 2 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/csum-file.c b/csum-file.c
index 5716016e12e..b28cd047e3f 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -50,7 +50,7 @@ void hashflush(struct hashfile *f)
 
 	if (offset) {
 		if (!f->skip_hash)
-			the_hash_algo->unsafe_update_fn(&f->ctx, f->buffer, offset);
+			f->algop->unsafe_update_fn(&f->ctx, f->buffer, offset);
 		flush(f, f->buffer, offset);
 		f->offset = 0;
 	}
@@ -71,14 +71,14 @@ int finalize_hashfile(struct hashfile *f, unsigned char *result,
 	hashflush(f);
 
 	if (f->skip_hash)
-		hashclr(f->buffer, the_repository->hash_algo);
+		hashclr(f->buffer, f->algop);
 	else
-		the_hash_algo->unsafe_final_fn(f->buffer, &f->ctx);
+		f->algop->unsafe_final_fn(f->buffer, &f->ctx);
 
 	if (result)
-		hashcpy(result, f->buffer, the_repository->hash_algo);
+		hashcpy(result, f->buffer, f->algop);
 	if (flags & CSUM_HASH_IN_STREAM)
-		flush(f, f->buffer, the_hash_algo->rawsz);
+		flush(f, f->buffer, f->algop->rawsz);
 	if (flags & CSUM_FSYNC)
 		fsync_component_or_die(component, f->fd, f->name);
 	if (flags & CSUM_CLOSE) {
@@ -128,7 +128,7 @@ void hashwrite(struct hashfile *f, const void *buf, unsigned int count)
 			 * f->offset is necessarily zero.
 			 */
 			if (!f->skip_hash)
-				the_hash_algo->unsafe_update_fn(&f->ctx, buf, nr);
+				f->algop->unsafe_update_fn(&f->ctx, buf, nr);
 			flush(f, buf, nr);
 		} else {
 			/*
@@ -174,7 +174,9 @@ static struct hashfile *hashfd_internal(int fd, const char *name,
 	f->name = name;
 	f->do_crc = 0;
 	f->skip_hash = 0;
-	the_hash_algo->unsafe_init_fn(&f->ctx);
+
+	f->algop = the_hash_algo;
+	f->algop->unsafe_init_fn(&f->ctx);
 
 	f->buffer_len = buffer_len;
 	f->buffer = xmalloc(buffer_len);
@@ -208,7 +210,7 @@ void hashfile_checkpoint(struct hashfile *f, struct hashfile_checkpoint *checkpo
 {
 	hashflush(f);
 	checkpoint->offset = f->total;
-	the_hash_algo->unsafe_clone_fn(&checkpoint->ctx, &f->ctx);
+	f->algop->unsafe_clone_fn(&checkpoint->ctx, &f->ctx);
 }
 
 int hashfile_truncate(struct hashfile *f, struct hashfile_checkpoint *checkpoint)
@@ -219,7 +221,7 @@ int hashfile_truncate(struct hashfile *f, struct hashfile_checkpoint *checkpoint
 	    lseek(f->fd, offset, SEEK_SET) != offset)
 		return -1;
 	f->total = offset;
-	the_hash_algo->unsafe_clone_fn(&f->ctx, &checkpoint->ctx);
+	f->algop->unsafe_clone_fn(&f->ctx, &checkpoint->ctx);
 	f->offset = 0; /* hashflush() was called in checkpoint */
 	return 0;
 }
diff --git a/csum-file.h b/csum-file.h
index 7c73da0a40a..2b45f4673a2 100644
--- a/csum-file.h
+++ b/csum-file.h
@@ -20,6 +20,7 @@ struct hashfile {
 	size_t buffer_len;
 	unsigned char *buffer;
 	unsigned char *check_buffer;
+	const struct git_hash_algo *algop;
 
 	/**
 	 * If non-zero, skip_hash indicates that we should
-- 
2.48.0.rc2.35.gd215225db14


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v4 3/8] csum-file.c: extract algop from hashfile_checksum_valid()
  2025-01-23 17:34 ` [PATCH v4 " Taylor Blau
  2025-01-23 17:34   ` [PATCH v4 1/8] t/helper/test-tool: implement sha1-unsafe helper Taylor Blau
  2025-01-23 17:34   ` [PATCH v4 2/8] csum-file: store the hash algorithm as a struct field Taylor Blau
@ 2025-01-23 17:34   ` Taylor Blau
  2025-01-23 17:34   ` [PATCH v4 4/8] hash.h: introduce `unsafe_hash_algo()` Taylor Blau
                     ` (5 subsequent siblings)
  8 siblings, 0 replies; 60+ messages in thread
From: Taylor Blau @ 2025-01-23 17:34 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Junio C Hamano, Elijah Newren, Patrick Steinhardt

Perform a similar transformation as in the previous commit, but focused
instead on hashfile_checksum_valid(). This function does not work with a
hashfile structure itself, and instead validates the raw contents of a
file written using the hashfile API.

We'll want to be prepared for a similar change to this function in the
future, so prepare ourselves for that by extracting 'the_hash_algo' into
its own field for use within this function.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 csum-file.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/csum-file.c b/csum-file.c
index b28cd047e3f..7a71121e340 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -242,14 +242,15 @@ int hashfile_checksum_valid(const unsigned char *data, size_t total_len)
 {
 	unsigned char got[GIT_MAX_RAWSZ];
 	git_hash_ctx ctx;
-	size_t data_len = total_len - the_hash_algo->rawsz;
+	const struct git_hash_algo *algop = the_hash_algo;
+	size_t data_len = total_len - algop->rawsz;
 
-	if (total_len < the_hash_algo->rawsz)
+	if (total_len < algop->rawsz)
 		return 0; /* say "too short"? */
 
-	the_hash_algo->unsafe_init_fn(&ctx);
-	the_hash_algo->unsafe_update_fn(&ctx, data, data_len);
-	the_hash_algo->unsafe_final_fn(got, &ctx);
+	algop->unsafe_init_fn(&ctx);
+	algop->unsafe_update_fn(&ctx, data, data_len);
+	algop->unsafe_final_fn(got, &ctx);
 
-	return hasheq(got, data + data_len, the_repository->hash_algo);
+	return hasheq(got, data + data_len, algop);
 }
-- 
2.48.0.rc2.35.gd215225db14


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v4 4/8] hash.h: introduce `unsafe_hash_algo()`
  2025-01-23 17:34 ` [PATCH v4 " Taylor Blau
                     ` (2 preceding siblings ...)
  2025-01-23 17:34   ` [PATCH v4 3/8] csum-file.c: extract algop from hashfile_checksum_valid() Taylor Blau
@ 2025-01-23 17:34   ` Taylor Blau
  2025-01-23 17:34   ` [PATCH v4 5/8] csum-file.c: use unsafe_hash_algo() Taylor Blau
                     ` (4 subsequent siblings)
  8 siblings, 0 replies; 60+ messages in thread
From: Taylor Blau @ 2025-01-23 17:34 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Junio C Hamano, Elijah Newren, Patrick Steinhardt

In 253ed9ecff (hash.h: scaffolding for _unsafe hashing variants,
2024-09-26), we introduced "unsafe" variants of the SHA-1 hashing
functions by introducing new functions like "unsafe_init_fn()" and so
on.

This approach has a major shortcoming that callers must remember to
consistently use one variant or the other. Failing to consistently use
(or not use) the unsafe variants can lead to crashes at best, or subtle
memory corruption issues at worst.

In the hashfile API, this isn't difficult to achieve, but verifying that
all callers consistently use the unsafe variants is somewhat of a chore
given how spread out all of the callers are. In the sha1 and sha1-unsafe
test helpers, all of the calls to various hash functions are guarded by
an "if (unsafe)" conditional, which is repetitive and cumbersome.

Address these issues by introducing a new pattern whereby one
'git_hash_algo' can return a pointer to another 'git_hash_algo' that
represents the unsafe version of itself. So instead of having something
like:

    if (unsafe)
      the_hash_algo->init_fn(...);
      the_hash_algo->update_fn(...);
      the_hash_algo->final_fn(...);
    else
      the_hash_algo->unsafe_init_fn(...);
      the_hash_algo->unsafe_update_fn(...);
      the_hash_algo->unsafe_final_fn(...);

we can instead write:

    struct git_hash_algo *algop = the_hash_algo;
    if (unsafe)
      algop = unsafe_hash_algo(algop);

    algop->init_fn(...);
    algop->update_fn(...);
    algop->final_fn(...);

This removes the existing shortcoming by no longer forcing the caller to
"remember" which variant of the hash functions it wants to call, only to
hold onto a 'struct git_hash_algo' pointer that is initialized once.

Similarly, while there currently is still a way to "mix" safe and unsafe
functions, this too will go away after subsequent commits remove all
direct calls to the unsafe_ variants.

Note that hash_algo_by_ptr() needs an adjustment to allow passing in the
unsafe variant of a hash function. All other query functions on the
hash_algos array will continue to return the safe variants of any
function.

Suggested-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 hash.h        | 13 ++++++++++++-
 object-file.c | 26 ++++++++++++++++++++++++++
 2 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/hash.h b/hash.h
index 756166ce5e8..0bf63cedfa4 100644
--- a/hash.h
+++ b/hash.h
@@ -305,6 +305,9 @@ struct git_hash_algo {
 
 	/* The all-zeros OID. */
 	const struct object_id *null_oid;
+
+	/* The unsafe variant of this hash function, if one exists. */
+	const struct git_hash_algo *unsafe;
 };
 extern const struct git_hash_algo hash_algos[GIT_HASH_NALGOS];
 
@@ -320,9 +323,17 @@ int hash_algo_by_length(int len);
 /* Identical, except for a pointer to struct git_hash_algo. */
 static inline int hash_algo_by_ptr(const struct git_hash_algo *p)
 {
-	return p - hash_algos;
+	size_t i;
+	for (i = 0; i < GIT_HASH_NALGOS; i++) {
+		const struct git_hash_algo *algop = &hash_algos[i];
+		if (p == algop)
+			return i;
+	}
+	return GIT_HASH_UNKNOWN;
 }
 
+const struct git_hash_algo *unsafe_hash_algo(const struct git_hash_algo *algop);
+
 const struct object_id *null_oid(void);
 
 static inline int hashcmp(const unsigned char *sha1, const unsigned char *sha2, const struct git_hash_algo *algop)
diff --git a/object-file.c b/object-file.c
index 5b792b3dd42..43efa4ca361 100644
--- a/object-file.c
+++ b/object-file.c
@@ -202,6 +202,22 @@ static void git_hash_unknown_final_oid(struct object_id *oid UNUSED,
 	BUG("trying to finalize unknown hash");
 }
 
+static const struct git_hash_algo sha1_unsafe_algo = {
+	.name = "sha1",
+	.format_id = GIT_SHA1_FORMAT_ID,
+	.rawsz = GIT_SHA1_RAWSZ,
+	.hexsz = GIT_SHA1_HEXSZ,
+	.blksz = GIT_SHA1_BLKSZ,
+	.init_fn = git_hash_sha1_init_unsafe,
+	.clone_fn = git_hash_sha1_clone_unsafe,
+	.update_fn = git_hash_sha1_update_unsafe,
+	.final_fn = git_hash_sha1_final_unsafe,
+	.final_oid_fn = git_hash_sha1_final_oid_unsafe,
+	.empty_tree = &empty_tree_oid,
+	.empty_blob = &empty_blob_oid,
+	.null_oid = &null_oid_sha1,
+};
+
 const struct git_hash_algo hash_algos[GIT_HASH_NALGOS] = {
 	{
 		.name = NULL,
@@ -239,6 +255,7 @@ const struct git_hash_algo hash_algos[GIT_HASH_NALGOS] = {
 		.unsafe_update_fn = git_hash_sha1_update_unsafe,
 		.unsafe_final_fn = git_hash_sha1_final_unsafe,
 		.unsafe_final_oid_fn = git_hash_sha1_final_oid_unsafe,
+		.unsafe = &sha1_unsafe_algo,
 		.empty_tree = &empty_tree_oid,
 		.empty_blob = &empty_blob_oid,
 		.null_oid = &null_oid_sha1,
@@ -305,6 +322,15 @@ int hash_algo_by_length(int len)
 	return GIT_HASH_UNKNOWN;
 }
 
+const struct git_hash_algo *unsafe_hash_algo(const struct git_hash_algo *algop)
+{
+	/* If we have a faster "unsafe" implementation, use that. */
+	if (algop->unsafe)
+		return algop->unsafe;
+	/* Otherwise use the default one. */
+	return algop;
+}
+
 /*
  * This is meant to hold a *small* number of objects that you would
  * want repo_read_object_file() to be able to return, but yet you do not want
-- 
2.48.0.rc2.35.gd215225db14


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v4 5/8] csum-file.c: use unsafe_hash_algo()
  2025-01-23 17:34 ` [PATCH v4 " Taylor Blau
                     ` (3 preceding siblings ...)
  2025-01-23 17:34   ` [PATCH v4 4/8] hash.h: introduce `unsafe_hash_algo()` Taylor Blau
@ 2025-01-23 17:34   ` Taylor Blau
  2025-01-23 17:34   ` [PATCH v4 6/8] t/helper/test-hash.c: " Taylor Blau
                     ` (3 subsequent siblings)
  8 siblings, 0 replies; 60+ messages in thread
From: Taylor Blau @ 2025-01-23 17:34 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Junio C Hamano, Elijah Newren, Patrick Steinhardt

Instead of calling the unsafe_ hash function variants directly, make use
of the shared 'algop' pointer by initializing it to:

    f->algop = unsafe_hash_algo(the_hash_algo);

, thus making all calls use the unsafe variants directly.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 csum-file.c | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/csum-file.c b/csum-file.c
index 7a71121e340..ebffc80ef71 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -50,7 +50,7 @@ void hashflush(struct hashfile *f)
 
 	if (offset) {
 		if (!f->skip_hash)
-			f->algop->unsafe_update_fn(&f->ctx, f->buffer, offset);
+			f->algop->update_fn(&f->ctx, f->buffer, offset);
 		flush(f, f->buffer, offset);
 		f->offset = 0;
 	}
@@ -73,7 +73,7 @@ int finalize_hashfile(struct hashfile *f, unsigned char *result,
 	if (f->skip_hash)
 		hashclr(f->buffer, f->algop);
 	else
-		f->algop->unsafe_final_fn(f->buffer, &f->ctx);
+		f->algop->final_fn(f->buffer, &f->ctx);
 
 	if (result)
 		hashcpy(result, f->buffer, f->algop);
@@ -128,7 +128,7 @@ void hashwrite(struct hashfile *f, const void *buf, unsigned int count)
 			 * f->offset is necessarily zero.
 			 */
 			if (!f->skip_hash)
-				f->algop->unsafe_update_fn(&f->ctx, buf, nr);
+				f->algop->update_fn(&f->ctx, buf, nr);
 			flush(f, buf, nr);
 		} else {
 			/*
@@ -175,8 +175,8 @@ static struct hashfile *hashfd_internal(int fd, const char *name,
 	f->do_crc = 0;
 	f->skip_hash = 0;
 
-	f->algop = the_hash_algo;
-	f->algop->unsafe_init_fn(&f->ctx);
+	f->algop = unsafe_hash_algo(the_hash_algo);
+	f->algop->init_fn(&f->ctx);
 
 	f->buffer_len = buffer_len;
 	f->buffer = xmalloc(buffer_len);
@@ -210,7 +210,7 @@ void hashfile_checkpoint(struct hashfile *f, struct hashfile_checkpoint *checkpo
 {
 	hashflush(f);
 	checkpoint->offset = f->total;
-	f->algop->unsafe_clone_fn(&checkpoint->ctx, &f->ctx);
+	f->algop->clone_fn(&checkpoint->ctx, &f->ctx);
 }
 
 int hashfile_truncate(struct hashfile *f, struct hashfile_checkpoint *checkpoint)
@@ -221,7 +221,7 @@ int hashfile_truncate(struct hashfile *f, struct hashfile_checkpoint *checkpoint
 	    lseek(f->fd, offset, SEEK_SET) != offset)
 		return -1;
 	f->total = offset;
-	f->algop->unsafe_clone_fn(&f->ctx, &checkpoint->ctx);
+	f->algop->clone_fn(&f->ctx, &checkpoint->ctx);
 	f->offset = 0; /* hashflush() was called in checkpoint */
 	return 0;
 }
@@ -242,15 +242,15 @@ int hashfile_checksum_valid(const unsigned char *data, size_t total_len)
 {
 	unsigned char got[GIT_MAX_RAWSZ];
 	git_hash_ctx ctx;
-	const struct git_hash_algo *algop = the_hash_algo;
+	const struct git_hash_algo *algop = unsafe_hash_algo(the_hash_algo);
 	size_t data_len = total_len - algop->rawsz;
 
 	if (total_len < algop->rawsz)
 		return 0; /* say "too short"? */
 
-	algop->unsafe_init_fn(&ctx);
-	algop->unsafe_update_fn(&ctx, data, data_len);
-	algop->unsafe_final_fn(got, &ctx);
+	algop->init_fn(&ctx);
+	algop->update_fn(&ctx, data, data_len);
+	algop->final_fn(got, &ctx);
 
 	return hasheq(got, data + data_len, algop);
 }
-- 
2.48.0.rc2.35.gd215225db14


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v4 6/8] t/helper/test-hash.c: use unsafe_hash_algo()
  2025-01-23 17:34 ` [PATCH v4 " Taylor Blau
                     ` (4 preceding siblings ...)
  2025-01-23 17:34   ` [PATCH v4 5/8] csum-file.c: use unsafe_hash_algo() Taylor Blau
@ 2025-01-23 17:34   ` Taylor Blau
  2025-01-23 17:34   ` [PATCH v4 7/8] csum-file: introduce hashfile_checkpoint_init() Taylor Blau
                     ` (2 subsequent siblings)
  8 siblings, 0 replies; 60+ messages in thread
From: Taylor Blau @ 2025-01-23 17:34 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Junio C Hamano, Elijah Newren, Patrick Steinhardt

Remove a series of conditionals within the shared cmd_hash_impl() helper
that powers the 'sha1' and 'sha1-unsafe' helpers.

Instead, replace them with a single conditional that transforms the
specified hash algorithm into its unsafe variant. Then all subsequent
calls can directly use whatever function it wants to call without having
to decide between the safe and unsafe variants.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/helper/test-hash.c | 17 +++++------------
 1 file changed, 5 insertions(+), 12 deletions(-)

diff --git a/t/helper/test-hash.c b/t/helper/test-hash.c
index d0ee668df95..aa82638c621 100644
--- a/t/helper/test-hash.c
+++ b/t/helper/test-hash.c
@@ -9,6 +9,8 @@ int cmd_hash_impl(int ac, const char **av, int algo, int unsafe)
 	int binary = 0;
 	char *buffer;
 	const struct git_hash_algo *algop = &hash_algos[algo];
+	if (unsafe)
+		algop = unsafe_hash_algo(algop);
 
 	if (ac == 2) {
 		if (!strcmp(av[1], "-b"))
@@ -27,10 +29,7 @@ int cmd_hash_impl(int ac, const char **av, int algo, int unsafe)
 			die("OOPS");
 	}
 
-	if (unsafe)
-		algop->unsafe_init_fn(&ctx);
-	else
-		algop->init_fn(&ctx);
+	algop->init_fn(&ctx);
 
 	while (1) {
 		ssize_t sz, this_sz;
@@ -49,15 +48,9 @@ int cmd_hash_impl(int ac, const char **av, int algo, int unsafe)
 		}
 		if (this_sz == 0)
 			break;
-		if (unsafe)
-			algop->unsafe_update_fn(&ctx, buffer, this_sz);
-		else
-			algop->update_fn(&ctx, buffer, this_sz);
+		algop->update_fn(&ctx, buffer, this_sz);
 	}
-	if (unsafe)
-		algop->unsafe_final_fn(hash, &ctx);
-	else
-		algop->final_fn(hash, &ctx);
+	algop->final_fn(hash, &ctx);
 
 	if (binary)
 		fwrite(hash, 1, algop->rawsz, stdout);
-- 
2.48.0.rc2.35.gd215225db14


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v4 7/8] csum-file: introduce hashfile_checkpoint_init()
  2025-01-23 17:34 ` [PATCH v4 " Taylor Blau
                     ` (5 preceding siblings ...)
  2025-01-23 17:34   ` [PATCH v4 6/8] t/helper/test-hash.c: " Taylor Blau
@ 2025-01-23 17:34   ` Taylor Blau
  2025-01-23 17:34   ` [PATCH v4 8/8] hash.h: drop unsafe_ function variants Taylor Blau
  2025-01-23 18:30   ` [PATCH v4 0/8] hash: introduce unsafe_hash_algo(), drop unsafe_ variants Junio C Hamano
  8 siblings, 0 replies; 60+ messages in thread
From: Taylor Blau @ 2025-01-23 17:34 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Junio C Hamano, Elijah Newren, Patrick Steinhardt

In 106140a99f (builtin/fast-import: fix segfault with unsafe SHA1
backend, 2024-12-30) and 9218c0bfe1 (bulk-checkin: fix segfault with
unsafe SHA1 backend, 2024-12-30), we observed the effects of failing to
initialize a hashfile_checkpoint with the same hash function
implementation as is used by the hashfile it is used to checkpoint.

While both 106140a99f and 9218c0bfe1 work around the immediate crash,
changing the hash function implementation within the hashfile API to,
for example, the non-unsafe variant would re-introduce the crash. This
is a result of the tight coupling between initializing hashfiles and
hashfile_checkpoints.

Introduce and use a new function which ensures that both parts of a
hashfile and hashfile_checkpoint pair use the same hash function
implementation to avoid such crashes.

A few things worth noting:

  - In the change to builtin/fast-import.c::stream_blob(), we can see
    that by removing the explicit reference to
    'the_hash_algo->unsafe_init_fn()', we are hardened against the
    hashfile API changing away from the_hash_algo (or its unsafe
    variant) in the future.

  - The bulk-checkin code no longer needs to explicitly zero-initialize
    the hashfile_checkpoint, since it is now done as a result of calling
    'hashfile_checkpoint_init()'.

  - Also in the bulk-checkin code, we add an additional call to
    prepare_to_stream() outside of the main loop in order to initialize
    'state->f' so we know which hash function implementation to use when
    calling 'hashfile_checkpoint_init()'.

    This is OK, since subsequent 'prepare_to_stream()' calls are noops.
    However, we only need to call 'prepare_to_stream()' when we have the
    HASH_WRITE_OBJECT bit set in our flags. Without that bit, calling
    'prepare_to_stream()' does not assign 'state->f', so we have nothing
    to initialize.

  - Other uses of the 'checkpoint' in 'deflate_blob_to_pack()' are
    appropriately guarded.

Helped-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/fast-import.c | 2 +-
 bulk-checkin.c        | 9 ++++++---
 csum-file.c           | 7 +++++++
 csum-file.h           | 1 +
 4 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/builtin/fast-import.c b/builtin/fast-import.c
index 0f86392761a..4a6c7ab52ac 100644
--- a/builtin/fast-import.c
+++ b/builtin/fast-import.c
@@ -1106,7 +1106,7 @@ static void stream_blob(uintmax_t len, struct object_id *oidout, uintmax_t mark)
 		|| (pack_size + PACK_SIZE_THRESHOLD + len) < pack_size)
 		cycle_packfile();
 
-	the_hash_algo->unsafe_init_fn(&checkpoint.ctx);
+	hashfile_checkpoint_init(pack_file, &checkpoint);
 	hashfile_checkpoint(pack_file, &checkpoint);
 	offset = checkpoint.offset;
 
diff --git a/bulk-checkin.c b/bulk-checkin.c
index 433070a3bda..892176d23d2 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -261,7 +261,7 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 	git_hash_ctx ctx;
 	unsigned char obuf[16384];
 	unsigned header_len;
-	struct hashfile_checkpoint checkpoint = {0};
+	struct hashfile_checkpoint checkpoint;
 	struct pack_idx_entry *idx = NULL;
 
 	seekback = lseek(fd, 0, SEEK_CUR);
@@ -272,12 +272,15 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 					  OBJ_BLOB, size);
 	the_hash_algo->init_fn(&ctx);
 	the_hash_algo->update_fn(&ctx, obuf, header_len);
-	the_hash_algo->unsafe_init_fn(&checkpoint.ctx);
 
 	/* Note: idx is non-NULL when we are writing */
-	if ((flags & HASH_WRITE_OBJECT) != 0)
+	if ((flags & HASH_WRITE_OBJECT) != 0) {
 		CALLOC_ARRAY(idx, 1);
 
+		prepare_to_stream(state, flags);
+		hashfile_checkpoint_init(state->f, &checkpoint);
+	}
+
 	already_hashed_to = 0;
 
 	while (1) {
diff --git a/csum-file.c b/csum-file.c
index ebffc80ef71..232121f415f 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -206,6 +206,13 @@ struct hashfile *hashfd_throughput(int fd, const char *name, struct progress *tp
 	return hashfd_internal(fd, name, tp, 8 * 1024);
 }
 
+void hashfile_checkpoint_init(struct hashfile *f,
+			      struct hashfile_checkpoint *checkpoint)
+{
+	memset(checkpoint, 0, sizeof(*checkpoint));
+	f->algop->init_fn(&checkpoint->ctx);
+}
+
 void hashfile_checkpoint(struct hashfile *f, struct hashfile_checkpoint *checkpoint)
 {
 	hashflush(f);
diff --git a/csum-file.h b/csum-file.h
index 2b45f4673a2..b7475f16c20 100644
--- a/csum-file.h
+++ b/csum-file.h
@@ -36,6 +36,7 @@ struct hashfile_checkpoint {
 	git_hash_ctx ctx;
 };
 
+void hashfile_checkpoint_init(struct hashfile *, struct hashfile_checkpoint *);
 void hashfile_checkpoint(struct hashfile *, struct hashfile_checkpoint *);
 int hashfile_truncate(struct hashfile *, struct hashfile_checkpoint *);
 
-- 
2.48.0.rc2.35.gd215225db14


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v4 8/8] hash.h: drop unsafe_ function variants
  2025-01-23 17:34 ` [PATCH v4 " Taylor Blau
                     ` (6 preceding siblings ...)
  2025-01-23 17:34   ` [PATCH v4 7/8] csum-file: introduce hashfile_checkpoint_init() Taylor Blau
@ 2025-01-23 17:34   ` Taylor Blau
  2025-01-23 18:30   ` [PATCH v4 0/8] hash: introduce unsafe_hash_algo(), drop unsafe_ variants Junio C Hamano
  8 siblings, 0 replies; 60+ messages in thread
From: Taylor Blau @ 2025-01-23 17:34 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Junio C Hamano, Elijah Newren, Patrick Steinhardt

Now that all callers have been converted from:

    the_hash_algo->unsafe_init_fn();

to

    unsafe_hash_algo(the_hash_algo)->init_fn();

and similar, we can remove the scaffolding for the unsafe_ function
variants and force callers to use the new unsafe_hash_algo() mechanic
instead.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 hash.h        | 15 ---------------
 object-file.c | 15 ---------------
 2 files changed, 30 deletions(-)

diff --git a/hash.h b/hash.h
index 0bf63cedfa4..ad2c919991c 100644
--- a/hash.h
+++ b/hash.h
@@ -282,21 +282,6 @@ struct git_hash_algo {
 	/* The hash finalization function for object IDs. */
 	git_hash_final_oid_fn final_oid_fn;
 
-	/* The non-cryptographic hash initialization function. */
-	git_hash_init_fn unsafe_init_fn;
-
-	/* The non-cryptographic hash context cloning function. */
-	git_hash_clone_fn unsafe_clone_fn;
-
-	/* The non-cryptographic hash update function. */
-	git_hash_update_fn unsafe_update_fn;
-
-	/* The non-cryptographic hash finalization function. */
-	git_hash_final_fn unsafe_final_fn;
-
-	/* The non-cryptographic hash finalization function. */
-	git_hash_final_oid_fn unsafe_final_oid_fn;
-
 	/* The OID of the empty tree. */
 	const struct object_id *empty_tree;
 
diff --git a/object-file.c b/object-file.c
index 43efa4ca361..c4b42dd4be9 100644
--- a/object-file.c
+++ b/object-file.c
@@ -230,11 +230,6 @@ const struct git_hash_algo hash_algos[GIT_HASH_NALGOS] = {
 		.update_fn = git_hash_unknown_update,
 		.final_fn = git_hash_unknown_final,
 		.final_oid_fn = git_hash_unknown_final_oid,
-		.unsafe_init_fn = git_hash_unknown_init,
-		.unsafe_clone_fn = git_hash_unknown_clone,
-		.unsafe_update_fn = git_hash_unknown_update,
-		.unsafe_final_fn = git_hash_unknown_final,
-		.unsafe_final_oid_fn = git_hash_unknown_final_oid,
 		.empty_tree = NULL,
 		.empty_blob = NULL,
 		.null_oid = NULL,
@@ -250,11 +245,6 @@ const struct git_hash_algo hash_algos[GIT_HASH_NALGOS] = {
 		.update_fn = git_hash_sha1_update,
 		.final_fn = git_hash_sha1_final,
 		.final_oid_fn = git_hash_sha1_final_oid,
-		.unsafe_init_fn = git_hash_sha1_init_unsafe,
-		.unsafe_clone_fn = git_hash_sha1_clone_unsafe,
-		.unsafe_update_fn = git_hash_sha1_update_unsafe,
-		.unsafe_final_fn = git_hash_sha1_final_unsafe,
-		.unsafe_final_oid_fn = git_hash_sha1_final_oid_unsafe,
 		.unsafe = &sha1_unsafe_algo,
 		.empty_tree = &empty_tree_oid,
 		.empty_blob = &empty_blob_oid,
@@ -271,11 +261,6 @@ const struct git_hash_algo hash_algos[GIT_HASH_NALGOS] = {
 		.update_fn = git_hash_sha256_update,
 		.final_fn = git_hash_sha256_final,
 		.final_oid_fn = git_hash_sha256_final_oid,
-		.unsafe_init_fn = git_hash_sha256_init,
-		.unsafe_clone_fn = git_hash_sha256_clone,
-		.unsafe_update_fn = git_hash_sha256_update,
-		.unsafe_final_fn = git_hash_sha256_final,
-		.unsafe_final_oid_fn = git_hash_sha256_final_oid,
 		.empty_tree = &empty_tree_oid_sha256,
 		.empty_blob = &empty_blob_oid_sha256,
 		.null_oid = &null_oid_sha256,
-- 
2.48.0.rc2.35.gd215225db14

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 0/8] hash: introduce unsafe_hash_algo(), drop unsafe_ variants
  2025-01-23 17:34 ` [PATCH v4 " Taylor Blau
                     ` (7 preceding siblings ...)
  2025-01-23 17:34   ` [PATCH v4 8/8] hash.h: drop unsafe_ function variants Taylor Blau
@ 2025-01-23 18:30   ` Junio C Hamano
  2025-01-23 18:50     ` Jeff King
  8 siblings, 1 reply; 60+ messages in thread
From: Junio C Hamano @ 2025-01-23 18:30 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Elijah Newren, Patrick Steinhardt

Taylor Blau <me@ttaylorr.com> writes:

> (This series is based on 14650065b7 (RelNotes/2.48.0: fix typos etc.,
> 2025-01-07)).
>
> Here is a hopefully final version of my series to harden the unsafe hash
> algorithm changes added in v2.47.0. The only difference from last time
> is that hash_algo_by_ptr() now returns GIT_HASH_UNKNOWN for NULL and
> unsafe variants, which is a strict improvement from both v3 of this
> series and the status-quo on master.
>
> As usual, a (small) range-diff is included below for convenience, and
> the original cover letter is as follows:

Thanks.  I'll mark it for 'next', unless there are any further
comments.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 0/8] hash: introduce unsafe_hash_algo(), drop unsafe_ variants
  2025-01-23 18:30   ` [PATCH v4 0/8] hash: introduce unsafe_hash_algo(), drop unsafe_ variants Junio C Hamano
@ 2025-01-23 18:50     ` Jeff King
  0 siblings, 0 replies; 60+ messages in thread
From: Jeff King @ 2025-01-23 18:50 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Taylor Blau, git, Elijah Newren, Patrick Steinhardt

On Thu, Jan 23, 2025 at 10:30:06AM -0800, Junio C Hamano wrote:

> Taylor Blau <me@ttaylorr.com> writes:
> 
> > (This series is based on 14650065b7 (RelNotes/2.48.0: fix typos etc.,
> > 2025-01-07)).
> >
> > Here is a hopefully final version of my series to harden the unsafe hash
> > algorithm changes added in v2.47.0. The only difference from last time
> > is that hash_algo_by_ptr() now returns GIT_HASH_UNKNOWN for NULL and
> > unsafe variants, which is a strict improvement from both v3 of this
> > series and the status-quo on master.
> >
> > As usual, a (small) range-diff is included below for convenience, and
> > the original cover letter is as follows:
> 
> Thanks.  I'll mark it for 'next', unless there are any further
> comments.

Nope, it looks good to me.

-Peff

^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2025-01-23 18:50 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-20 19:13 [PATCH 0/6] hash: introduce unsafe_hash_algo(), drop unsafe_ variants Taylor Blau
2024-11-20 19:13 ` [PATCH 1/6] csum-file: store the hash algorithm as a struct field Taylor Blau
2024-11-21  9:18   ` Jeff King
2024-11-20 19:13 ` [PATCH 2/6] csum-file.c: extract algop from hashfile_checksum_valid() Taylor Blau
2024-11-20 19:13 ` [PATCH 3/6] hash.h: introduce `unsafe_hash_algo()` Taylor Blau
2024-11-21  9:37   ` Jeff King
2024-11-22  0:39     ` brian m. carlson
2024-11-22  8:25       ` Jeff King
2024-11-22 20:37         ` brian m. carlson
2025-01-10 21:38     ` Taylor Blau
2025-01-11  2:45       ` Jeff King
2024-11-20 19:13 ` [PATCH 4/6] csum-file.c: use unsafe_hash_algo() Taylor Blau
2024-11-20 19:13 ` [PATCH 5/6] t/helper/test-hash.c: " Taylor Blau
2024-11-20 19:13 ` [PATCH 6/6] hash.h: drop unsafe_ function variants Taylor Blau
2024-11-21  9:41   ` Jeff King
2025-01-08 19:14 ` [PATCH v2 0/8] hash: introduce unsafe_hash_algo(), drop unsafe_ variants Taylor Blau
2025-01-08 19:14   ` [PATCH v2 1/8] t/helper/test-tool: implement sha1-unsafe helper Taylor Blau
2025-01-08 19:14   ` [PATCH v2 2/8] csum-file: store the hash algorithm as a struct field Taylor Blau
2025-01-16 11:48     ` Patrick Steinhardt
2025-01-17 21:17       ` Taylor Blau
2025-01-08 19:14   ` [PATCH v2 3/8] csum-file.c: extract algop from hashfile_checksum_valid() Taylor Blau
2025-01-08 19:14   ` [PATCH v2 4/8] hash.h: introduce `unsafe_hash_algo()` Taylor Blau
2025-01-16 11:49     ` Patrick Steinhardt
2025-01-17 21:18       ` Taylor Blau
2025-01-08 19:14   ` [PATCH v2 5/8] csum-file.c: use unsafe_hash_algo() Taylor Blau
2025-01-08 19:14   ` [PATCH v2 6/8] t/helper/test-hash.c: " Taylor Blau
2025-01-08 19:14   ` [PATCH v2 7/8] csum-file: introduce hashfile_checkpoint_init() Taylor Blau
2025-01-10 10:37     ` Jeff King
2025-01-10 21:50       ` Taylor Blau
2025-01-17 21:30         ` Taylor Blau
2025-01-18 12:15           ` Jeff King
2025-01-08 19:14   ` [PATCH v2 8/8] hash.h: drop unsafe_ function variants Taylor Blau
2025-01-10 10:41   ` [PATCH v2 0/8] hash: introduce unsafe_hash_algo(), drop unsafe_ variants Jeff King
2025-01-10 21:29     ` Taylor Blau
2025-01-11  2:42       ` Jeff King
2025-01-11  0:14   ` Junio C Hamano
2025-01-11 17:14     ` Taylor Blau
2025-01-17 22:03 ` [PATCH v3 " Taylor Blau
2025-01-17 22:03   ` [PATCH v3 1/8] t/helper/test-tool: implement sha1-unsafe helper Taylor Blau
2025-01-17 22:03   ` [PATCH v3 2/8] csum-file: store the hash algorithm as a struct field Taylor Blau
2025-01-17 22:03   ` [PATCH v3 3/8] csum-file.c: extract algop from hashfile_checksum_valid() Taylor Blau
2025-01-17 22:03   ` [PATCH v3 4/8] hash.h: introduce `unsafe_hash_algo()` Taylor Blau
2025-01-17 22:03   ` [PATCH v3 5/8] csum-file.c: use unsafe_hash_algo() Taylor Blau
2025-01-17 22:03   ` [PATCH v3 6/8] t/helper/test-hash.c: " Taylor Blau
2025-01-17 22:03   ` [PATCH v3 7/8] csum-file: introduce hashfile_checkpoint_init() Taylor Blau
2025-01-17 22:03   ` [PATCH v3 8/8] hash.h: drop unsafe_ function variants Taylor Blau
2025-01-18 12:28   ` [PATCH v3 0/8] hash: introduce unsafe_hash_algo(), drop unsafe_ variants Jeff King
2025-01-18 12:43     ` Jeff King
2025-01-22 21:31       ` Junio C Hamano
2025-01-23 17:34 ` [PATCH v4 " Taylor Blau
2025-01-23 17:34   ` [PATCH v4 1/8] t/helper/test-tool: implement sha1-unsafe helper Taylor Blau
2025-01-23 17:34   ` [PATCH v4 2/8] csum-file: store the hash algorithm as a struct field Taylor Blau
2025-01-23 17:34   ` [PATCH v4 3/8] csum-file.c: extract algop from hashfile_checksum_valid() Taylor Blau
2025-01-23 17:34   ` [PATCH v4 4/8] hash.h: introduce `unsafe_hash_algo()` Taylor Blau
2025-01-23 17:34   ` [PATCH v4 5/8] csum-file.c: use unsafe_hash_algo() Taylor Blau
2025-01-23 17:34   ` [PATCH v4 6/8] t/helper/test-hash.c: " Taylor Blau
2025-01-23 17:34   ` [PATCH v4 7/8] csum-file: introduce hashfile_checkpoint_init() Taylor Blau
2025-01-23 17:34   ` [PATCH v4 8/8] hash.h: drop unsafe_ function variants Taylor Blau
2025-01-23 18:30   ` [PATCH v4 0/8] hash: introduce unsafe_hash_algo(), drop unsafe_ variants Junio C Hamano
2025-01-23 18:50     ` Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).