[PATCH 00/14] SHA-1/SHA-256 interoperability, part 2

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2
@ 2025-10-27  0:43 brian m. carlson
  2025-10-27  0:43 ` [PATCH 01/14] repository: require Rust support for interoperability brian m. carlson
                   ` (17 more replies)
  0 siblings, 18 replies; 101+ messages in thread
From: brian m. carlson @ 2025-10-27  0:43 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren

This is the second part of the SHA-1/SHA-256 interoperability work.  It
introduces our first major use of Rust code to implement a loose object
format as well as preparatory work to make that happen, including
changing types to more Rust-friendly ones.  Since Rust will be required
for the interoperability work, we require that in the testsuite.

We also verify that our object ID algorithm is valid when looking up
data in the hash map since the Rust code intentionally has no knowledge
about global mutable state like the_repository and so cannot default to
the main hash algorithm when we've zero-initialized a struct object_id.

The advantage to this Rust code is that it is comprehensively tested
with unit testing.  We can serialize our loose object map and then
verify that we can also load it again and perform various testing, such
as whether certain object IDs are found in the map and mapped correctly.
We can also test our slightly subtle custom binary search code
effectively and be confident that it works, since Rust doesn't provide a
way to binary search slices of variable length.

The new Rust files have adopted an approach that is slightly different
from some of our other files and placed a license notice at the top.
This is required because of DCO part (a): "I have the right to submit it
under the open source license indicated in the file".  It also avoids
ambiguity if the file is copied into a separate location (such as an LLM
training corpus).

brian m. carlson (14):
  repository: require Rust support for interoperability
  conversion: don't crash when no destination algo
  hash: use uint32_t for object_id algorithm
  rust: add a ObjectID struct
  rust: add a hash algorithm abstraction
  hash: add a function to look up hash algo structs
  csum-file: define hashwrite's count as a uint32_t
  write-or-die: add an fsync component for the loose object map
  hash: expose hash context functions to Rust
  rust: add a build.rs script for tests
  rust: add functionality to hash an object
  rust: add a new binary loose object map format
  rust: add a small wrapper around the hashfile code
  object-file-convert: always make sure object ID algo is valid

 Documentation/gitformat-loose.adoc | 104 ++++
 Makefile                           |   5 +-
 build.rs                           |  21 +
 csum-file.c                        |   2 +-
 csum-file.h                        |   2 +-
 hash.c                             |  46 +-
 hash.h                             |  38 +-
 object-file-convert.c              |  14 +-
 oidtree.c                          |   2 +-
 repository.c                       |  13 +-
 repository.h                       |   4 +-
 serve.c                            |   2 +-
 src/csum_file.rs                   |  81 +++
 src/hash.rs                        | 335 +++++++++++
 src/lib.rs                         |   3 +
 src/loose.rs                       | 912 +++++++++++++++++++++++++++++
 src/meson.build                    |   3 +
 t/t1006-cat-file.sh                |  82 ++-
 t/t1016-compatObjectFormat.sh      |   6 +
 t/t1500-rev-parse.sh               |   2 +-
 t/t9305-fast-import-signatures.sh  |   4 +-
 t/t9350-fast-export.sh             |   4 +-
 t/test-lib.sh                      |   4 +
 write-or-die.h                     |   4 +-
 24 files changed, 1619 insertions(+), 74 deletions(-)
 create mode 100644 build.rs
 create mode 100644 src/csum_file.rs
 create mode 100644 src/hash.rs
 create mode 100644 src/loose.rs

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH 01/14] repository: require Rust support for interoperability
  2025-10-27  0:43 [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 brian m. carlson
@ 2025-10-27  0:43 ` brian m. carlson
  2025-10-28  9:16   ` Patrick Steinhardt
  2025-10-27  0:43 ` [PATCH 02/14] conversion: don't crash when no destination algo brian m. carlson
                   ` (16 subsequent siblings)
  17 siblings, 1 reply; 101+ messages in thread
From: brian m. carlson @ 2025-10-27  0:43 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren

We'll be implementing some of our interoperability code, like the loose
object map, in Rust.  While the code currently compiles with the old
loose object map format, which is written entirely in C, we'll soon
replace that with the Rust-based implementation.

Require the use of Rust for compatibility mode and die if it is not
supported.  Because the repo argument is not used when Rust is missing,
cast it to void to silence the compiler warning, which we do not care
about.

Add a prerequisite in our tests, RUST, that checks if Rust functionality
is available and use it in the tests that handle interoperability.

This is technically a regression in functionality compared to our
existing state, but pack index v3 is not yet implemented and thus the
functionality is mostly quite broken, which is why we've recently marked
this functionality as experimental.  We don't believe anyone is getting
useful use out of the interoperability code in its current state, so no
actual users should be negatively impacted by this change.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 repository.c                      |  7 +++
 t/t1006-cat-file.sh               | 82 +++++++++++++++++++++----------
 t/t1016-compatObjectFormat.sh     |  6 +++
 t/t1500-rev-parse.sh              |  2 +-
 t/t9305-fast-import-signatures.sh |  4 +-
 t/t9350-fast-export.sh            |  4 +-
 t/test-lib.sh                     |  4 ++
 7 files changed, 77 insertions(+), 32 deletions(-)

diff --git a/repository.c b/repository.c
index 6faf5c7398..823f110019 100644
--- a/repository.c
+++ b/repository.c
@@ -3,6 +3,7 @@
 #include "repository.h"
 #include "odb.h"
 #include "config.h"
+#include "gettext.h"
 #include "object.h"
 #include "lockfile.h"
 #include "path.h"
@@ -192,11 +193,17 @@ void repo_set_hash_algo(struct repository *repo, int hash_algo)
 
 void repo_set_compat_hash_algo(struct repository *repo, int algo)
 {
+#ifdef WITH_RUST
 	if (hash_algo_by_ptr(repo->hash_algo) == algo)
 		BUG("hash_algo and compat_hash_algo match");
 	repo->compat_hash_algo = algo ? &hash_algos[algo] : NULL;
 	if (repo->compat_hash_algo)
 		repo_read_loose_object_map(repo);
+#else
+	(void)repo;
+	if (algo)
+		die(_("compatibility hash algorithm support requires Rust"));
+#endif
 }
 
 void repo_set_ref_storage_format(struct repository *repo,
diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh
index 1f61b666a7..29a9503523 100755
--- a/t/t1006-cat-file.sh
+++ b/t/t1006-cat-file.sh
@@ -241,10 +241,16 @@ hello_content="Hello World"
 hello_size=$(strlen "$hello_content")
 hello_oid=$(echo_without_newline "$hello_content" | git hash-object --stdin)
 
-test_expect_success "setup" '
+test_expect_success "setup part 1" '
 	git config core.repositoryformatversion 1 &&
-	git config extensions.objectformat $test_hash_algo &&
-	git config extensions.compatobjectformat $test_compat_hash_algo &&
+	git config extensions.objectformat $test_hash_algo
+'
+
+test_expect_success RUST 'compat setup' '
+	git config extensions.compatobjectformat $test_compat_hash_algo
+'
+
+test_expect_success 'setup part 2' '
 	echo_without_newline "$hello_content" > hello &&
 	git update-index --add hello &&
 	echo_without_newline "$hello_content" > "path with spaces" &&
@@ -273,9 +279,13 @@ run_blob_tests () {
     '
 }
 
-hello_compat_oid=$(git rev-parse --output-object-format=$test_compat_hash_algo $hello_oid)
 run_blob_tests $hello_oid
-run_blob_tests $hello_compat_oid
+
+if test_have_prereq RUST
+then
+	hello_compat_oid=$(git rev-parse --output-object-format=$test_compat_hash_algo $hello_oid)
+	run_blob_tests $hello_compat_oid
+fi
 
 test_expect_success '--batch-check without %(rest) considers whole line' '
 	echo "$hello_oid blob $hello_size" >expect &&
@@ -286,62 +296,76 @@ test_expect_success '--batch-check without %(rest) considers whole line' '
 '
 
 tree_oid=$(git write-tree)
-tree_compat_oid=$(git rev-parse --output-object-format=$test_compat_hash_algo $tree_oid)
 tree_size=$((2 * $(test_oid rawsz) + 13 + 24))
-tree_compat_size=$((2 * $(test_oid --hash=compat rawsz) + 13 + 24))
 tree_pretty_content="100644 blob $hello_oid	hello${LF}100755 blob $hello_oid	path with spaces${LF}"
-tree_compat_pretty_content="100644 blob $hello_compat_oid	hello${LF}100755 blob $hello_compat_oid	path with spaces${LF}"
 
 run_tests 'tree' $tree_oid "" $tree_size "" "$tree_pretty_content"
-run_tests 'tree' $tree_compat_oid "" $tree_compat_size "" "$tree_compat_pretty_content"
 run_tests 'blob' "$tree_oid:hello" "100644" $hello_size "" "$hello_content" $hello_oid
-run_tests 'blob' "$tree_compat_oid:hello" "100644" $hello_size "" "$hello_content" $hello_compat_oid
 run_tests 'blob' "$tree_oid:path with spaces" "100755" $hello_size "" "$hello_content" $hello_oid
-run_tests 'blob' "$tree_compat_oid:path with spaces" "100755" $hello_size "" "$hello_content" $hello_compat_oid
+
+if test_have_prereq RUST
+then
+	tree_compat_oid=$(git rev-parse --output-object-format=$test_compat_hash_algo $tree_oid)
+	tree_compat_size=$((2 * $(test_oid --hash=compat rawsz) + 13 + 24))
+	tree_compat_pretty_content="100644 blob $hello_compat_oid	hello${LF}100755 blob $hello_compat_oid	path with spaces${LF}"
+
+	run_tests 'tree' $tree_compat_oid "" $tree_compat_size "" "$tree_compat_pretty_content"
+	run_tests 'blob' "$tree_compat_oid:hello" "100644" $hello_size "" "$hello_content" $hello_compat_oid
+	run_tests 'blob' "$tree_compat_oid:path with spaces" "100755" $hello_size "" "$hello_content" $hello_compat_oid
+fi
 
 commit_message="Initial commit"
 commit_oid=$(echo_without_newline "$commit_message" | git commit-tree $tree_oid)
-commit_compat_oid=$(git rev-parse --output-object-format=$test_compat_hash_algo $commit_oid)
 commit_size=$(($(test_oid hexsz) + 137))
-commit_compat_size=$(($(test_oid --hash=compat hexsz) + 137))
 commit_content="tree $tree_oid
 author $GIT_AUTHOR_NAME <$GIT_AUTHOR_EMAIL> $GIT_AUTHOR_DATE
 committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
 
 $commit_message"
 
-commit_compat_content="tree $tree_compat_oid
+run_tests 'commit' $commit_oid "" $commit_size "$commit_content" "$commit_content"
+
+if test_have_prereq RUST
+then
+	commit_compat_oid=$(git rev-parse --output-object-format=$test_compat_hash_algo $commit_oid)
+	commit_compat_size=$(($(test_oid --hash=compat hexsz) + 137))
+	commit_compat_content="tree $tree_compat_oid
 author $GIT_AUTHOR_NAME <$GIT_AUTHOR_EMAIL> $GIT_AUTHOR_DATE
 committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
 
 $commit_message"
 
-run_tests 'commit' $commit_oid "" $commit_size "$commit_content" "$commit_content"
-run_tests 'commit' $commit_compat_oid "" $commit_compat_size "$commit_compat_content" "$commit_compat_content"
+	run_tests 'commit' $commit_compat_oid "" $commit_compat_size "$commit_compat_content" "$commit_compat_content"
+fi
 
 tag_header_without_oid="type blob
 tag hellotag
 tagger $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL>"
 tag_header_without_timestamp="object $hello_oid
 $tag_header_without_oid"
-tag_compat_header_without_timestamp="object $hello_compat_oid
-$tag_header_without_oid"
 tag_description="This is a tag"
 tag_content="$tag_header_without_timestamp 0 +0000
 
-$tag_description"
-tag_compat_content="$tag_compat_header_without_timestamp 0 +0000
-
 $tag_description"
 
 tag_oid=$(echo_without_newline "$tag_content" | git hash-object -t tag --stdin -w)
 tag_size=$(strlen "$tag_content")
 
-tag_compat_oid=$(git rev-parse --output-object-format=$test_compat_hash_algo $tag_oid)
-tag_compat_size=$(strlen "$tag_compat_content")
-
 run_tests 'tag' $tag_oid "" $tag_size "$tag_content" "$tag_content"
-run_tests 'tag' $tag_compat_oid "" $tag_compat_size "$tag_compat_content" "$tag_compat_content"
+
+if test_have_prereq RUST
+then
+	tag_compat_header_without_timestamp="object $hello_compat_oid
+$tag_header_without_oid"
+	tag_compat_content="$tag_compat_header_without_timestamp 0 +0000
+
+$tag_description"
+
+	tag_compat_oid=$(git rev-parse --output-object-format=$test_compat_hash_algo $tag_oid)
+	tag_compat_size=$(strlen "$tag_compat_content")
+
+	run_tests 'tag' $tag_compat_oid "" $tag_compat_size "$tag_compat_content" "$tag_compat_content"
+fi
 
 test_expect_success "Reach a blob from a tag pointing to it" '
 	echo_without_newline "$hello_content" >expect &&
@@ -590,7 +614,8 @@ flush"
 }
 
 batch_tests $hello_oid $tree_oid $tree_size $commit_oid $commit_size "$commit_content" $tag_oid $tag_size "$tag_content"
-batch_tests $hello_compat_oid $tree_compat_oid $tree_compat_size $commit_compat_oid $commit_compat_size "$commit_compat_content" $tag_compat_oid $tag_compat_size "$tag_compat_content"
+
+test_have_prereq RUST && batch_tests $hello_compat_oid $tree_compat_oid $tree_compat_size $commit_compat_oid $commit_compat_size "$commit_compat_content" $tag_compat_oid $tag_compat_size "$tag_compat_content"
 
 
 test_expect_success FUNNYNAMES 'setup with newline in input' '
@@ -1226,7 +1251,10 @@ test_expect_success 'batch-check with a submodule' '
 	test_unconfig extensions.compatobjectformat &&
 	printf "160000 commit $(test_oid deadbeef)\tsub\n" >tree-with-sub &&
 	tree=$(git mktree <tree-with-sub) &&
-	test_config extensions.compatobjectformat $test_compat_hash_algo &&
+	if test_have_prereq RUST
+	then
+		test_config extensions.compatobjectformat $test_compat_hash_algo
+	fi &&
 
 	git cat-file --batch-check >actual <<-EOF &&
 	$tree:sub
diff --git a/t/t1016-compatObjectFormat.sh b/t/t1016-compatObjectFormat.sh
index a9af8b2396..af3ceac3f5 100755
--- a/t/t1016-compatObjectFormat.sh
+++ b/t/t1016-compatObjectFormat.sh
@@ -8,6 +8,12 @@ test_description='Test how well compatObjectFormat works'
 . ./test-lib.sh
 . "$TEST_DIRECTORY"/lib-gpg.sh
 
+if ! test_have_prereq RUST
+then
+	skip_all='interoperability requires a Git built with Rust'
+	test_done
+fi
+
 # All of the follow variables must be defined in the environment:
 # GIT_AUTHOR_NAME
 # GIT_AUTHOR_EMAIL
diff --git a/t/t1500-rev-parse.sh b/t/t1500-rev-parse.sh
index 7739ab611b..98c5a772bd 100755
--- a/t/t1500-rev-parse.sh
+++ b/t/t1500-rev-parse.sh
@@ -208,7 +208,7 @@ test_expect_success 'rev-parse --show-object-format in repo' '
 '
 
 
-test_expect_success 'rev-parse --show-object-format in repo with compat mode' '
+test_expect_success RUST 'rev-parse --show-object-format in repo with compat mode' '
 	mkdir repo &&
 	(
 		sane_unset GIT_DEFAULT_HASH &&
diff --git a/t/t9305-fast-import-signatures.sh b/t/t9305-fast-import-signatures.sh
index c2b4271658..63c0a2b5c4 100755
--- a/t/t9305-fast-import-signatures.sh
+++ b/t/t9305-fast-import-signatures.sh
@@ -70,7 +70,7 @@ test_expect_success GPGSSH 'strip SSH signature with --signed-commits=strip' '
 	test_must_be_empty log
 '
 
-test_expect_success GPG 'setup a commit with dual OpenPGP signatures on its SHA-1 and SHA-256 formats' '
+test_expect_success RUST,GPG 'setup a commit with dual OpenPGP signatures on its SHA-1 and SHA-256 formats' '
 	# Create a signed SHA-256 commit
 	git init --object-format=sha256 explicit-sha256 &&
 	git -C explicit-sha256 config extensions.compatObjectFormat sha1 &&
@@ -91,7 +91,7 @@ test_expect_success GPG 'setup a commit with dual OpenPGP signatures on its SHA-
 	test_grep -E "^gpgsig-sha256 " out
 '
 
-test_expect_success GPG 'strip both OpenPGP signatures with --signed-commits=warn-strip' '
+test_expect_success RUST,GPG 'strip both OpenPGP signatures with --signed-commits=warn-strip' '
 	git -C explicit-sha256 fast-export --signed-commits=verbatim dual-signed >output &&
 	test_grep -E "^gpgsig sha1 openpgp" output &&
 	test_grep -E "^gpgsig sha256 openpgp" output &&
diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh
index 8f85c69d62..bf55e1e2e6 100755
--- a/t/t9350-fast-export.sh
+++ b/t/t9350-fast-export.sh
@@ -932,7 +932,7 @@ test_expect_success 'fast-export handles --end-of-options' '
 	test_cmp expect actual
 '
 
-test_expect_success GPG 'setup a commit with dual signatures on its SHA-1 and SHA-256 formats' '
+test_expect_success GPG,RUST 'setup a commit with dual signatures on its SHA-1 and SHA-256 formats' '
 	# Create a signed SHA-256 commit
 	git init --object-format=sha256 explicit-sha256 &&
 	git -C explicit-sha256 config extensions.compatObjectFormat sha1 &&
@@ -953,7 +953,7 @@ test_expect_success GPG 'setup a commit with dual signatures on its SHA-1 and SH
 	test_grep -E "^gpgsig-sha256 " out
 '
 
-test_expect_success GPG 'export and import of doubly signed commit' '
+test_expect_success GPG,RUST 'export and import of doubly signed commit' '
 	git -C explicit-sha256 fast-export --signed-commits=verbatim dual-signed >output &&
 	test_grep -E "^gpgsig sha1 openpgp" output &&
 	test_grep -E "^gpgsig sha256 openpgp" output &&
diff --git a/t/test-lib.sh b/t/test-lib.sh
index ef0ab7ec2d..3499a83806 100644
--- a/t/test-lib.sh
+++ b/t/test-lib.sh
@@ -1890,6 +1890,10 @@ test_lazy_prereq LONG_IS_64BIT '
 	test 8 -le "$(build_option sizeof-long)"
 '
 
+test_lazy_prereq RUST '
+	test "$(build_option rust)" = enabled
+'
+
 test_lazy_prereq TIME_IS_64BIT 'test-tool date is64bit'
 test_lazy_prereq TIME_T_IS_64BIT 'test-tool date time_t-is64bit'
 

^ permalink raw reply related	[flat|nested] 101+ messages in thread

* Re: [PATCH 01/14] repository: require Rust support for interoperability
  2025-10-27  0:43 ` [PATCH 01/14] repository: require Rust support for interoperability brian m. carlson
@ 2025-10-28  9:16   ` Patrick Steinhardt
  0 siblings, 0 replies; 101+ messages in thread
From: Patrick Steinhardt @ 2025-10-28  9:16 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Junio C Hamano, Ezekiel Newren

On Mon, Oct 27, 2025 at 12:43:51AM +0000, brian m. carlson wrote:
> We'll be implementing some of our interoperability code, like the loose
> object map, in Rust.  While the code currently compiles with the old
> loose object map format, which is written entirely in C, we'll soon
> replace that with the Rust-based implementation.
> 
> Require the use of Rust for compatibility mode and die if it is not
> supported.  Because the repo argument is not used when Rust is missing,
> cast it to void to silence the compiler warning, which we do not care
> about.
> 
> Add a prerequisite in our tests, RUST, that checks if Rust functionality
> is available and use it in the tests that handle interoperability.
> 
> This is technically a regression in functionality compared to our
> existing state, but pack index v3 is not yet implemented and thus the
> functionality is mostly quite broken, which is why we've recently marked
> this functionality as experimental.  We don't believe anyone is getting
> useful use out of the interoperability code in its current state, so no
> actual users should be negatively impacted by this change.

Yeah, I don't see much of an issue with this.

> diff --git a/repository.c b/repository.c
> index 6faf5c7398..823f110019 100644
> --- a/repository.c
> +++ b/repository.c
> @@ -192,11 +193,17 @@ void repo_set_hash_algo(struct repository *repo, int hash_algo)
>  
>  void repo_set_compat_hash_algo(struct repository *repo, int algo)
>  {
> +#ifdef WITH_RUST
>  	if (hash_algo_by_ptr(repo->hash_algo) == algo)
>  		BUG("hash_algo and compat_hash_algo match");
>  	repo->compat_hash_algo = algo ? &hash_algos[algo] : NULL;
>  	if (repo->compat_hash_algo)
>  		repo_read_loose_object_map(repo);
> +#else
> +	(void)repo;

You can annotate `repo` with `MAYBE_UNUSED` instead of casting.

Patrick

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH 02/14] conversion: don't crash when no destination algo
  2025-10-27  0:43 [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 brian m. carlson
  2025-10-27  0:43 ` [PATCH 01/14] repository: require Rust support for interoperability brian m. carlson
@ 2025-10-27  0:43 ` brian m. carlson
  2025-10-27  0:43 ` [PATCH 03/14] hash: use uint32_t for object_id algorithm brian m. carlson
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 101+ messages in thread
From: brian m. carlson @ 2025-10-27  0:43 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren

When we set up a repository that doesn't have a compatibility hash
algorithm, we set the destination algorithm object to NULL.  In such a
case, we want to silently do nothing instead of crashing, so simply
treat the operation as a no-op and copy the object ID.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 object-file-convert.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/object-file-convert.c b/object-file-convert.c
index 7ab875afe6..e44c821084 100644
--- a/object-file-convert.c
+++ b/object-file-convert.c
@@ -23,7 +23,7 @@ int repo_oid_to_algop(struct repository *repo, const struct object_id *src,
 	const struct git_hash_algo *from =
 		src->algo ? &hash_algos[src->algo] : repo->hash_algo;
 
-	if (from == to) {
+	if (from == to || !to) {
 		if (src != dest)
 			oidcpy(dest, src);
 		return 0;

^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH 03/14] hash: use uint32_t for object_id algorithm
  2025-10-27  0:43 [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 brian m. carlson
  2025-10-27  0:43 ` [PATCH 01/14] repository: require Rust support for interoperability brian m. carlson
  2025-10-27  0:43 ` [PATCH 02/14] conversion: don't crash when no destination algo brian m. carlson
@ 2025-10-27  0:43 ` brian m. carlson
  2025-10-28  9:16   ` Patrick Steinhardt
  2025-10-27  0:43 ` [PATCH 04/14] rust: add a ObjectID struct brian m. carlson
                   ` (14 subsequent siblings)
  17 siblings, 1 reply; 101+ messages in thread
From: brian m. carlson @ 2025-10-27  0:43 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren

We currently use an int for this value, but we'll define this structure
from Rust in a future commit and we want to ensure that our data types
are exactly identical.  To make that possible, use a uint32_t for the
hash algorithm.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 hash.c       |  6 +++---
 hash.h       | 10 +++++-----
 oidtree.c    |  2 +-
 repository.c |  6 +++---
 repository.h |  4 ++--
 serve.c      |  2 +-
 6 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/hash.c b/hash.c
index 4a04ecb50e..81b4f87027 100644
--- a/hash.c
+++ b/hash.c
@@ -241,7 +241,7 @@ const char *empty_tree_oid_hex(const struct git_hash_algo *algop)
 	return oid_to_hex_r(buf, algop->empty_tree);
 }
 
-int hash_algo_by_name(const char *name)
+uint32_t hash_algo_by_name(const char *name)
 {
 	if (!name)
 		return GIT_HASH_UNKNOWN;
@@ -251,7 +251,7 @@ int hash_algo_by_name(const char *name)
 	return GIT_HASH_UNKNOWN;
 }
 
-int hash_algo_by_id(uint32_t format_id)
+uint32_t hash_algo_by_id(uint32_t format_id)
 {
 	for (size_t i = 1; i < GIT_HASH_NALGOS; i++)
 		if (format_id == hash_algos[i].format_id)
@@ -259,7 +259,7 @@ int hash_algo_by_id(uint32_t format_id)
 	return GIT_HASH_UNKNOWN;
 }
 
-int hash_algo_by_length(size_t len)
+uint32_t hash_algo_by_length(size_t len)
 {
 	for (size_t i = 1; i < GIT_HASH_NALGOS; i++)
 		if (len == hash_algos[i].rawsz)
diff --git a/hash.h b/hash.h
index fae966b23c..99c9c2a0a8 100644
--- a/hash.h
+++ b/hash.h
@@ -211,7 +211,7 @@ static inline void git_SHA256_Clone(git_SHA256_CTX *dst, const git_SHA256_CTX *s
 
 struct object_id {
 	unsigned char hash[GIT_MAX_RAWSZ];
-	int algo;	/* XXX requires 4-byte alignment */
+	uint32_t algo;	/* XXX requires 4-byte alignment */
 };
 
 #define GET_OID_QUIETLY                  01
@@ -344,13 +344,13 @@ static inline void git_hash_final_oid(struct object_id *oid, struct git_hash_ctx
  * Return a GIT_HASH_* constant based on the name.  Returns GIT_HASH_UNKNOWN if
  * the name doesn't match a known algorithm.
  */
-int hash_algo_by_name(const char *name);
+uint32_t hash_algo_by_name(const char *name);
 /* Identical, except based on the format ID. */
-int hash_algo_by_id(uint32_t format_id);
+uint32_t hash_algo_by_id(uint32_t format_id);
 /* Identical, except based on the length. */
-int hash_algo_by_length(size_t len);
+uint32_t hash_algo_by_length(size_t len);
 /* Identical, except for a pointer to struct git_hash_algo. */
-static inline int hash_algo_by_ptr(const struct git_hash_algo *p)
+static inline uint32_t hash_algo_by_ptr(const struct git_hash_algo *p)
 {
 	size_t i;
 	for (i = 0; i < GIT_HASH_NALGOS; i++) {
diff --git a/oidtree.c b/oidtree.c
index 151568f74f..324de94934 100644
--- a/oidtree.c
+++ b/oidtree.c
@@ -10,7 +10,7 @@ struct oidtree_iter_data {
 	oidtree_iter fn;
 	void *arg;
 	size_t *last_nibble_at;
-	int algo;
+	uint32_t algo;
 	uint8_t last_byte;
 };
 
diff --git a/repository.c b/repository.c
index 823f110019..34a029b1e4 100644
--- a/repository.c
+++ b/repository.c
@@ -39,7 +39,7 @@ struct repository *the_repository = &the_repo;
 static void set_default_hash_algo(struct repository *repo)
 {
 	const char *hash_name;
-	int algo;
+	uint32_t algo;
 
 	hash_name = getenv("GIT_TEST_DEFAULT_HASH_ALGO");
 	if (!hash_name)
@@ -186,12 +186,12 @@ void repo_set_gitdir(struct repository *repo,
 			repo->gitdir, "index");
 }
 
-void repo_set_hash_algo(struct repository *repo, int hash_algo)
+void repo_set_hash_algo(struct repository *repo, uint32_t hash_algo)
 {
 	repo->hash_algo = &hash_algos[hash_algo];
 }
 
-void repo_set_compat_hash_algo(struct repository *repo, int algo)
+void repo_set_compat_hash_algo(struct repository *repo, uint32_t algo)
 {
 #ifdef WITH_RUST
 	if (hash_algo_by_ptr(repo->hash_algo) == algo)
diff --git a/repository.h b/repository.h
index 5808a5d610..c0a3543b24 100644
--- a/repository.h
+++ b/repository.h
@@ -193,8 +193,8 @@ struct set_gitdir_args {
 void repo_set_gitdir(struct repository *repo, const char *root,
 		     const struct set_gitdir_args *extra_args);
 void repo_set_worktree(struct repository *repo, const char *path);
-void repo_set_hash_algo(struct repository *repo, int algo);
-void repo_set_compat_hash_algo(struct repository *repo, int compat_algo);
+void repo_set_hash_algo(struct repository *repo, uint32_t algo);
+void repo_set_compat_hash_algo(struct repository *repo, uint32_t compat_algo);
 void repo_set_ref_storage_format(struct repository *repo,
 				 enum ref_storage_format format);
 void initialize_repository(struct repository *repo);
diff --git a/serve.c b/serve.c
index 53ecab3b42..49a6e39b1d 100644
--- a/serve.c
+++ b/serve.c
@@ -14,7 +14,7 @@
 
 static int advertise_sid = -1;
 static int advertise_object_info = -1;
-static int client_hash_algo = GIT_HASH_SHA1_LEGACY;
+static uint32_t client_hash_algo = GIT_HASH_SHA1_LEGACY;
 
 static int always_advertise(struct repository *r UNUSED,
 			    struct strbuf *value UNUSED)

^ permalink raw reply related	[flat|nested] 101+ messages in thread

* Re: [PATCH 03/14] hash: use uint32_t for object_id algorithm
  2025-10-27  0:43 ` [PATCH 03/14] hash: use uint32_t for object_id algorithm brian m. carlson
@ 2025-10-28  9:16   ` Patrick Steinhardt
  2025-10-28 18:28     ` Ezekiel Newren
                       ` (2 more replies)
  0 siblings, 3 replies; 101+ messages in thread
From: Patrick Steinhardt @ 2025-10-28  9:16 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Junio C Hamano, Ezekiel Newren

On Mon, Oct 27, 2025 at 12:43:53AM +0000, brian m. carlson wrote:
> We currently use an int for this value, but we'll define this structure
> from Rust in a future commit and we want to ensure that our data types
> are exactly identical.  To make that possible, use a uint32_t for the
> hash algorithm.

An alternative would be to introduce an enum and set up bindgen so that
we can pull this enum into Rust. I'd personally favor that over using an
uint32_t as it conveys way more meaning. Have you considered this?

Patrick

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 03/14] hash: use uint32_t for object_id algorithm
  2025-10-28  9:16   ` Patrick Steinhardt
@ 2025-10-28 18:28     ` Ezekiel Newren
  2025-10-28 19:33     ` Junio C Hamano
  2025-10-29  0:33     ` brian m. carlson
  2 siblings, 0 replies; 101+ messages in thread
From: Ezekiel Newren @ 2025-10-28 18:28 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: brian m. carlson, git, Junio C Hamano

On Tue, Oct 28, 2025 at 3:17 AM Patrick Steinhardt <ps@pks.im> wrote:
>
> On Mon, Oct 27, 2025 at 12:43:53AM +0000, brian m. carlson wrote:
> > We currently use an int for this value, but we'll define this structure
> > from Rust in a future commit and we want to ensure that our data types
> > are exactly identical.  To make that possible, use a uint32_t for the
> > hash algorithm.
>
> An alternative would be to introduce an enum and set up bindgen so that
> we can pull this enum into Rust. I'd personally favor that over using an
> uint32_t as it conveys way more meaning. Have you considered this?

I think uint32_t is appropriate here over an enum because this value
will also exist on disk. An enum in Rust is really only safe if it
exists exclusively in memory, and is untouched by C. Later in this
patch series there is a function that creates an enum from a u32. I
agree with Brian's design choice here.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 03/14] hash: use uint32_t for object_id algorithm
  2025-10-28  9:16   ` Patrick Steinhardt
  2025-10-28 18:28     ` Ezekiel Newren
@ 2025-10-28 19:33     ` Junio C Hamano
  2025-10-28 19:58       ` Ezekiel Newren
  2025-10-30  0:23       ` brian m. carlson
  2025-10-29  0:33     ` brian m. carlson
  2 siblings, 2 replies; 101+ messages in thread
From: Junio C Hamano @ 2025-10-28 19:33 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: brian m. carlson, git, Ezekiel Newren

Patrick Steinhardt <ps@pks.im> writes:

> On Mon, Oct 27, 2025 at 12:43:53AM +0000, brian m. carlson wrote:
>> We currently use an int for this value, but we'll define this structure
>> from Rust in a future commit and we want to ensure that our data types
>> are exactly identical.  To make that possible, use a uint32_t for the
>> hash algorithm.
>
> An alternative would be to introduce an enum and set up bindgen so that
> we can pull this enum into Rust. I'd personally favor that over using an
> uint32_t as it conveys way more meaning. Have you considered this?

Yeah, I do not very much appreciate change from "int" to "uint32_t"
randomly done only for things that happen to be used by both C and
Rust.  "When should I use 'int' or 'unsigned' and when should I use
'uint32_t'?" becomes extremely hard to answer.

I suspect that it would be much more palatable if these functions
and struct members are to use a distinct type that is used only by
hash algorithm number (your "enum" is fine), that is typedef'ed to
be the 32-bit unsigned integer, e.g,

    +typedef uint32_t hash_algo_type;
    -int hash_algo_by_name(const char *name)
    +hash_algo_type hash_algo_by_name(const char *name)

Yeah, I know that C does not give us type safety against mixing two
different things, both of which are typedef'ed to the same uint32_t,
but doing something like the above would still add documentation
value.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 03/14] hash: use uint32_t for object_id algorithm
  2025-10-28 19:33     ` Junio C Hamano
@ 2025-10-28 19:58       ` Ezekiel Newren
  2025-10-28 20:20         ` Junio C Hamano
  2025-10-30  0:23       ` brian m. carlson
  1 sibling, 1 reply; 101+ messages in thread
From: Ezekiel Newren @ 2025-10-28 19:58 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Patrick Steinhardt, brian m. carlson, git

On Tue, Oct 28, 2025 at 1:33 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Patrick Steinhardt <ps@pks.im> writes:
>
> > On Mon, Oct 27, 2025 at 12:43:53AM +0000, brian m. carlson wrote:
> >> We currently use an int for this value, but we'll define this structure
> >> from Rust in a future commit and we want to ensure that our data types
> >> are exactly identical.  To make that possible, use a uint32_t for the
> >> hash algorithm.
> >
> > An alternative would be to introduce an enum and set up bindgen so that
> > we can pull this enum into Rust. I'd personally favor that over using an
> > uint32_t as it conveys way more meaning. Have you considered this?
>
> Yeah, I do not very much appreciate change from "int" to "uint32_t"
> randomly done only for things that happen to be used by both C and
> Rust.  "When should I use 'int' or 'unsigned' and when should I use
> 'uint32_t'?" becomes extremely hard to answer.

I think the most appropriate time to change from C's ambiguous types
to unambiguous types is when it's going to be used for Rust FFI.
uint32_t should be used everywhere and casting to int or unsigned
should be done where that code hasn't been converted yet. This commit
isn't random, it's a deliberate effort to address code debt.

> I suspect that it would be much more palatable if these functions
> and struct members are to use a distinct type that is used only by
> hash algorithm number (your "enum" is fine), that is typedef'ed to
> be the 32-bit unsigned integer, e.g,
>
>     +typedef uint32_t hash_algo_type;
>     -int hash_algo_by_name(const char *name)
>     +hash_algo_type hash_algo_by_name(const char *name)
>
> Yeah, I know that C does not give us type safety against mixing two
> different things, both of which are typedef'ed to the same uint32_t,
> but doing something like the above would still add documentation
> value.

I'm against passing Rust enum types over the FFI boundary since Rust
is free to add extra bytes to distinguish between types (and it's
documented by Rust as not being ABI stable). Even if something like
#[repr(C)] is used the problem is that the enum on the Rust side will
have an implicit field where that implicit field will need to be made
explicit on the C side, and if C sets an invalid value for that
implicit field then that will result in Rust UB. Converting Rust enum
types to C is non-trival and has many gotchas.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 03/14] hash: use uint32_t for object_id algorithm
  2025-10-28 19:58       ` Ezekiel Newren
@ 2025-10-28 20:20         ` Junio C Hamano
  0 siblings, 0 replies; 101+ messages in thread
From: Junio C Hamano @ 2025-10-28 20:20 UTC (permalink / raw)
  To: Ezekiel Newren; +Cc: Patrick Steinhardt, brian m. carlson, git

Ezekiel Newren <ezekielnewren@gmail.com> writes:

>> I suspect that it would be much more palatable if these functions
>> and struct members are to use a distinct type that is used only by
>> hash algorithm number (your "enum" is fine), that is typedef'ed to
>> be the 32-bit unsigned integer, e.g,
>>
>>     +typedef uint32_t hash_algo_type;
>>     -int hash_algo_by_name(const char *name)
>>     +hash_algo_type hash_algo_by_name(const char *name)
>>
>> Yeah, I know that C does not give us type safety against mixing two
>> different things, both of which are typedef'ed to the same uint32_t,
>> but doing something like the above would still add documentation
>> value.
>
> I'm against passing Rust enum types over the FFI boundary since Rust
> is free to add extra bytes to distinguish between types (and it's
> documented by Rust as not being ABI stable).

It's OK for you to be against it.

My mention of "enum" was enum on the purely C-side and I didn't have
Rust's enum in mind at all.  As Brian defined ObjectID on the Rust
side, the type tag was done as u32, IIUC, not Rust's enum.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 03/14] hash: use uint32_t for object_id algorithm
  2025-10-28 19:33     ` Junio C Hamano
  2025-10-28 19:58       ` Ezekiel Newren
@ 2025-10-30  0:23       ` brian m. carlson
  2025-10-30  1:58         ` Collin Funk
  1 sibling, 1 reply; 101+ messages in thread
From: brian m. carlson @ 2025-10-30  0:23 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Patrick Steinhardt, git, Ezekiel Newren

[-- Attachment #1: Type: text/plain, Size: 1810 bytes --]

On 2025-10-28 at 19:33:32, Junio C Hamano wrote:
> Yeah, I do not very much appreciate change from "int" to "uint32_t"
> randomly done only for things that happen to be used by both C and
> Rust.  "When should I use 'int' or 'unsigned' and when should I use
> 'uint32_t'?" becomes extremely hard to answer.

In general, the answer is that we should use `int` or `unsigned` when
you're defining a loop index or other non-structure types that are only
used from C.  Otherwise, we should use one of the stdint.h or stddef.h
types ((u)int*_t, (s)size_t, etc.), since these have defined,
well-understood sizes.  Also, in general, we want to use unsigned types
for things that cannot have valid negative values (such as the hash
algorithm constants that are also array indices), especially since Rust
tends not to use sentinel values (preferring `Option` instead).

Part of our problem is that being lazy and making lots of assumptions in
our codebase has led to some suboptimal consequences.  Our diff code
can't handle files bigger than about 1 GiB because we use `int` and
Windows has all sorts of size limitations because we assumed that
sizeof(long) == sizeof(size_t) == sizeof(void *).  Nobody now would say,
"Gee, I think we'd like to have these arbitrary 32-bit size limits," and
using something with a fixed size helps us think, "How big should this
data type be?  Do I really want to limit this data structure to
processing only 32 bits worth of data?"

In this case, the use of a 32-bit value is fine because we already have
that for the existing type (via `int`) and it is extremely unlikely that
4 billion cryptographic hash algorithms will ever be created, let alone
implemented in Git, so the size is not a factor.
-- 
brian m. carlson (they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 03/14] hash: use uint32_t for object_id algorithm
  2025-10-30  0:23       ` brian m. carlson
@ 2025-10-30  1:58         ` Collin Funk
  2025-11-03  1:30           ` brian m. carlson
  0 siblings, 1 reply; 101+ messages in thread
From: Collin Funk @ 2025-10-30  1:58 UTC (permalink / raw)
  To: brian m. carlson; +Cc: Junio C Hamano, Patrick Steinhardt, git, Ezekiel Newren

Hi Brian,

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> On 2025-10-28 at 19:33:32, Junio C Hamano wrote:
>> Yeah, I do not very much appreciate change from "int" to "uint32_t"
>> randomly done only for things that happen to be used by both C and
>> Rust.  "When should I use 'int' or 'unsigned' and when should I use
>> 'uint32_t'?" becomes extremely hard to answer.
>
> In general, the answer is that we should use `int` or `unsigned` when
> you're defining a loop index or other non-structure types that are only
> used from C.  Otherwise, we should use one of the stdint.h or stddef.h
> types ((u)int*_t, (s)size_t, etc.), since these have defined,
> well-understood sizes.  Also, in general, we want to use unsigned types
> for things that cannot have valid negative values (such as the hash
> algorithm constants that are also array indices), especially since Rust
> tends not to use sentinel values (preferring `Option` instead).

I don't necessarily disagree with your point, just want to reiterate a
point a touched on in another thread [1]. In some cases it is valuable
to use signed integers even if a valid value will never be negative.
This is because signed integer overflow can be easily caught with
-fsanitize=undefined. An unsigned integer wrapping around is perfectly
defined, but may lead to strange bugs in your program.

> Part of our problem is that being lazy and making lots of assumptions in
> our codebase has led to some suboptimal consequences.  Our diff code
> can't handle files bigger than about 1 GiB because we use `int` and
> Windows has all sorts of size limitations because we assumed that
> sizeof(long) == sizeof(size_t) == sizeof(void *).  Nobody now would say,
> "Gee, I think we'd like to have these arbitrary 32-bit size limits," and
> using something with a fixed size helps us think, "How big should this
> data type be?  Do I really want to limit this data structure to
> processing only 32 bits worth of data?"
>
> In this case, the use of a 32-bit value is fine because we already have
> that for the existing type (via `int`) and it is extremely unlikely that
> 4 billion cryptographic hash algorithms will ever be created, let alone
> implemented in Git, so the size is not a factor.

I guess intmax_t and uintmax_t are probably not usable with Rust, since
they are not fixed width?

Collin

[1] https://public-inbox.org/git/87jz16dux5.fsf@gmail.com/

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 03/14] hash: use uint32_t for object_id algorithm
  2025-10-30  1:58         ` Collin Funk
@ 2025-11-03  1:30           ` brian m. carlson
  0 siblings, 0 replies; 101+ messages in thread
From: brian m. carlson @ 2025-11-03  1:30 UTC (permalink / raw)
  To: Collin Funk; +Cc: Junio C Hamano, Patrick Steinhardt, git, Ezekiel Newren

[-- Attachment #1: Type: text/plain, Size: 1031 bytes --]

On 2025-10-30 at 01:58:52, Collin Funk wrote:
> I guess intmax_t and uintmax_t are probably not usable with Rust, since
> they are not fixed width?

They are effectively 64 bit everywhere, so `i64` or `u64` is
appropriate.  These types are not actually the largest possible integers
anymore, since they were originally defined as 64 bit and implementers
refused to change them once 128-bit values were supported because that
would break ABI.

With gcc or clang, you can do this to see:

    % clang -E -dM - </dev/null | grep INTMAX_TYPE
    #define __INTMAX_TYPE__ long int
    #define __UINTMAX_TYPE__ long unsigned int

Rust also has `i128` and `u128`, which are part of the ABI and are also
used for things like `std::time::Duration::as_nanos`.  Rust claims that
it is ABI-compatible with C's `__int128` where that exists, but it does
not in all C compilers and on all architectures.  Compatibility with C's
`_BitInt(128)` is explicitly disclaimed.
-- 
brian m. carlson (they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 03/14] hash: use uint32_t for object_id algorithm
  2025-10-28  9:16   ` Patrick Steinhardt
  2025-10-28 18:28     ` Ezekiel Newren
  2025-10-28 19:33     ` Junio C Hamano
@ 2025-10-29  0:33     ` brian m. carlson
  2025-10-29  9:07       ` Patrick Steinhardt
  2 siblings, 1 reply; 101+ messages in thread
From: brian m. carlson @ 2025-10-29  0:33 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Junio C Hamano, Ezekiel Newren

[-- Attachment #1: Type: text/plain, Size: 578 bytes --]

On 2025-10-28 at 09:16:57, Patrick Steinhardt wrote:
> An alternative would be to introduce an enum and set up bindgen so that
> we can pull this enum into Rust. I'd personally favor that over using an
> uint32_t as it conveys way more meaning. Have you considered this?

That would lead to problems because we zero-initialize some object IDs
(and you see later in the series what problems that causes) and that
will absolutely not work in Rust, since setting an enum to an invalid
value is undefined behaviour.
-- 
brian m. carlson (they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 03/14] hash: use uint32_t for object_id algorithm
  2025-10-29  0:33     ` brian m. carlson
@ 2025-10-29  9:07       ` Patrick Steinhardt
  0 siblings, 0 replies; 101+ messages in thread
From: Patrick Steinhardt @ 2025-10-29  9:07 UTC (permalink / raw)
  To: brian m. carlson, git, Junio C Hamano, Ezekiel Newren

On Wed, Oct 29, 2025 at 12:33:30AM +0000, brian m. carlson wrote:
> On 2025-10-28 at 09:16:57, Patrick Steinhardt wrote:
> > An alternative would be to introduce an enum and set up bindgen so that
> > we can pull this enum into Rust. I'd personally favor that over using an
> > uint32_t as it conveys way more meaning. Have you considered this?
> 
> That would lead to problems because we zero-initialize some object IDs
> (and you see later in the series what problems that causes) and that
> will absolutely not work in Rust, since setting an enum to an invalid
> value is undefined behaviour.

We could of course try and represent the uninitialized state with a
third enum state. But it would probably make things awfully unergonomic
all over the place :/

Patrick

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH 04/14] rust: add a ObjectID struct
  2025-10-27  0:43 [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 brian m. carlson
                   ` (2 preceding siblings ...)
  2025-10-27  0:43 ` [PATCH 03/14] hash: use uint32_t for object_id algorithm brian m. carlson
@ 2025-10-27  0:43 ` brian m. carlson
  2025-10-28  9:17   ` Patrick Steinhardt
  2025-10-27  0:43 ` [PATCH 05/14] rust: add a hash algorithm abstraction brian m. carlson
                   ` (13 subsequent siblings)
  17 siblings, 1 reply; 101+ messages in thread
From: brian m. carlson @ 2025-10-27  0:43 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren

We'd like to be able to write some Rust code that can work with object
IDs.  Add a structure here that's identical to struct object_id in C.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 Makefile        |  1 +
 src/hash.rs     | 21 +++++++++++++++++++++
 src/lib.rs      |  1 +
 src/meson.build |  1 +
 4 files changed, 24 insertions(+)
 create mode 100644 src/hash.rs

diff --git a/Makefile b/Makefile
index 1919d35bf3..7e5a735ca6 100644
--- a/Makefile
+++ b/Makefile
@@ -1521,6 +1521,7 @@ CLAR_TEST_OBJS += $(UNIT_TEST_DIR)/unit-test.o
 
 UNIT_TEST_OBJS += $(UNIT_TEST_DIR)/test-lib.o
 
+RUST_SOURCES += src/hash.rs
 RUST_SOURCES += src/lib.rs
 RUST_SOURCES += src/varint.rs
 
diff --git a/src/hash.rs b/src/hash.rs
new file mode 100644
index 0000000000..0219391820
--- /dev/null
+++ b/src/hash.rs
@@ -0,0 +1,21 @@
+// This program is free software; you can redistribute it and/or modify
+// it under the terms of the GNU General Public License as published by
+// the Free Software Foundation: version 2 of the License, dated June 1991.
+//
+// This program is distributed in the hope that it will be useful,
+// but WITHOUT ANY WARRANTY; without even the implied warranty of
+// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+// GNU General Public License for more details.
+//
+// You should have received a copy of the GNU General Public License along
+// with this program; if not, see <https://www.gnu.org/licenses/>.
+
+pub const GIT_MAX_RAWSZ: usize = 32;
+
+/// A binary object ID.
+#[repr(C)]
+#[derive(Debug, Clone, Ord, PartialOrd, Eq, PartialEq)]
+pub struct ObjectID {
+    pub hash: [u8; GIT_MAX_RAWSZ],
+    pub algo: u32,
+}
diff --git a/src/lib.rs b/src/lib.rs
index 9da70d8b57..cf7c962509 100644
--- a/src/lib.rs
+++ b/src/lib.rs
@@ -1 +1,2 @@
+pub mod hash;
 pub mod varint;
diff --git a/src/meson.build b/src/meson.build
index 25b9ad5a14..c77041a3fa 100644
--- a/src/meson.build
+++ b/src/meson.build
@@ -1,4 +1,5 @@
 libgit_rs_sources = [
+  'hash.rs',
   'lib.rs',
   'varint.rs',
 ]

^ permalink raw reply related	[flat|nested] 101+ messages in thread

* Re: [PATCH 04/14] rust: add a ObjectID struct
  2025-10-27  0:43 ` [PATCH 04/14] rust: add a ObjectID struct brian m. carlson
@ 2025-10-28  9:17   ` Patrick Steinhardt
  2025-10-28 19:07     ` Ezekiel Newren
                       ` (2 more replies)
  0 siblings, 3 replies; 101+ messages in thread
From: Patrick Steinhardt @ 2025-10-28  9:17 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Junio C Hamano, Ezekiel Newren

On Mon, Oct 27, 2025 at 12:43:54AM +0000, brian m. carlson wrote:
> diff --git a/src/hash.rs b/src/hash.rs
> new file mode 100644
> index 0000000000..0219391820
> --- /dev/null
> +++ b/src/hash.rs
> @@ -0,0 +1,21 @@
> +// This program is free software; you can redistribute it and/or modify
> +// it under the terms of the GNU General Public License as published by
> +// the Free Software Foundation: version 2 of the License, dated June 1991.
> +//
> +// This program is distributed in the hope that it will be useful,
> +// but WITHOUT ANY WARRANTY; without even the implied warranty of
> +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +// GNU General Public License for more details.
> +//
> +// You should have received a copy of the GNU General Public License along
> +// with this program; if not, see <https://www.gnu.org/licenses/>.

We typically don't have these headers for our C code, so why have it
over here?

> +pub const GIT_MAX_RAWSZ: usize = 32;
> +
> +/// A binary object ID.
> +#[repr(C)]
> +#[derive(Debug, Clone, Ord, PartialOrd, Eq, PartialEq)]
> +pub struct ObjectID {
> +    pub hash: [u8; GIT_MAX_RAWSZ],
> +    pub algo: u32,
> +}

An alternative to represent this type would be to use an enum:

    pub enum ObjectID {
        SHA1([u8; GIT_SHA1_RAWSZ]),
        SHA256([u8; GIT_SHA256_RAWSZ]),
    }

That would give us some type safety going forward, but it might be
harder to work with for us?

Patrick

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 04/14] rust: add a ObjectID struct
  2025-10-28  9:17   ` Patrick Steinhardt
@ 2025-10-28 19:07     ` Ezekiel Newren
  2025-10-29  0:42       ` brian m. carlson
  2025-10-28 19:40     ` Junio C Hamano
  2025-10-29  0:36     ` brian m. carlson
  2 siblings, 1 reply; 101+ messages in thread
From: Ezekiel Newren @ 2025-10-28 19:07 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: brian m. carlson, git, Junio C Hamano

On Tue, Oct 28, 2025 at 3:17 AM Patrick Steinhardt <ps@pks.im> wrote:
>
> On Mon, Oct 27, 2025 at 12:43:54AM +0000, brian m. carlson wrote:
> > diff --git a/src/hash.rs b/src/hash.rs
> > new file mode 100644
> > index 0000000000..0219391820
> > --- /dev/null
> > +++ b/src/hash.rs
> > @@ -0,0 +1,21 @@
> > +// This program is free software; you can redistribute it and/or modify
> > +// it under the terms of the GNU General Public License as published by
> > +// the Free Software Foundation: version 2 of the License, dated June 1991.
> > +//
> > +// This program is distributed in the hope that it will be useful,
> > +// but WITHOUT ANY WARRANTY; without even the implied warranty of
> > +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > +// GNU General Public License for more details.
> > +//
> > +// You should have received a copy of the GNU General Public License along
> > +// with this program; if not, see <https://www.gnu.org/licenses/>.
>
> We typically don't have these headers for our C code, so why have it
> over here?

I'm wondering this too even though you gave a reason in your cover
letter. I'm against putting licenses in each source file, and don't
see how it's better than having a separate license file.

> > +pub const GIT_MAX_RAWSZ: usize = 32;
> > +
> > +/// A binary object ID.
> > +#[repr(C)]
> > +#[derive(Debug, Clone, Ord, PartialOrd, Eq, PartialEq)]
> > +pub struct ObjectID {
> > +    pub hash: [u8; GIT_MAX_RAWSZ],
> > +    pub algo: u32,
> > +}
>
> An alternative to represent this type would be to use an enum:
>
>     pub enum ObjectID {
>         SHA1([u8; GIT_SHA1_RAWSZ]),
>         SHA256([u8; GIT_SHA256_RAWSZ]),
>     }
>
> That would give us some type safety going forward, but it might be
> harder to work with for us?

This would be fine if it was used exclusively in Rust, but since this
is a type that has to cross the FFI boundary it should be defined as a
struct in C and Rust. If you run size_of::<ObjectId>() you'll get 33
(but it could be something else). Without #[repr(C, u8)] the Rust
compiler is free to choose how to define the discriminant (its length
and values) to distinguish the 2 types. If you do use #[repr(C, u8)]
then you have the possible problem of C setting an invalid
discriminant value which would result in undefined behavior. It also
doesn't make sense as an FFI type since a Rust enum is closer to a C
union than a C enum. The point here is that Brian is matching the
existing C struct with an equivalent Rust struct.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 04/14] rust: add a ObjectID struct
  2025-10-28 19:07     ` Ezekiel Newren
@ 2025-10-29  0:42       ` brian m. carlson
  0 siblings, 0 replies; 101+ messages in thread
From: brian m. carlson @ 2025-10-29  0:42 UTC (permalink / raw)
  To: Ezekiel Newren; +Cc: Patrick Steinhardt, git, Junio C Hamano

[-- Attachment #1: Type: text/plain, Size: 1474 bytes --]

On 2025-10-28 at 19:07:36, Ezekiel Newren wrote:
> I'm wondering this too even though you gave a reason in your cover
> letter. I'm against putting licenses in each source file, and don't
> see how it's better than having a separate license file.

As I said, the DCO says the "open source license indicated in the file".
I also see lots of open source code being sucked into LLMs these days as
training data and I want the LLM to learn that Git's code is GPLv2, so
when it produces output, it does so with the GPLv2 header in the file.

We already have similar notices in the reftable code, so there's plenty
of precedent for it.

> This would be fine if it was used exclusively in Rust, but since this
> is a type that has to cross the FFI boundary it should be defined as a
> struct in C and Rust. If you run size_of::<ObjectId>() you'll get 33
> (but it could be something else). Without #[repr(C, u8)] the Rust
> compiler is free to choose how to define the discriminant (its length
> and values) to distinguish the 2 types. If you do use #[repr(C, u8)]
> then you have the possible problem of C setting an invalid
> discriminant value which would result in undefined behavior. It also
> doesn't make sense as an FFI type since a Rust enum is closer to a C
> union than a C enum. The point here is that Brian is matching the
> existing C struct with an equivalent Rust struct.

Exactly.
-- 
brian m. carlson (they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 04/14] rust: add a ObjectID struct
  2025-10-28  9:17   ` Patrick Steinhardt
  2025-10-28 19:07     ` Ezekiel Newren
@ 2025-10-28 19:40     ` Junio C Hamano
  2025-10-29  0:47       ` brian m. carlson
  2025-10-29  0:36     ` brian m. carlson
  2 siblings, 1 reply; 101+ messages in thread
From: Junio C Hamano @ 2025-10-28 19:40 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: brian m. carlson, git, Ezekiel Newren

Patrick Steinhardt <ps@pks.im> writes:

> On Mon, Oct 27, 2025 at 12:43:54AM +0000, brian m. carlson wrote:
>> diff --git a/src/hash.rs b/src/hash.rs
>> new file mode 100644
>> index 0000000000..0219391820
>> --- /dev/null
>> +++ b/src/hash.rs
>> @@ -0,0 +1,21 @@
>> +// This program is free software; you can redistribute it and/or modify
>> +// it under the terms of the GNU General Public License as published by
>> +// the Free Software Foundation: version 2 of the License, dated June 1991.
>> +//
>> +// This program is distributed in the hope that it will be useful,
>> +// but WITHOUT ANY WARRANTY; without even the implied warranty of
>> +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> +// GNU General Public License for more details.
>> +//
>> +// You should have received a copy of the GNU General Public License along
>> +// with this program; if not, see <https://www.gnu.org/licenses/>.
>
> We typically don't have these headers for our C code, so why have it
> over here?

Yeah, another thing that puzzles me is if src/ is a good name for
the directory in the longer run (unless we plan to rewrite
everything in Rust, that is) for housing our source code written in
Rust (I am assuming that *.c files are unwelcome in that directory).
But it may be a separate topic, perhaps?

>> +pub const GIT_MAX_RAWSZ: usize = 32;
>> +
>> +/// A binary object ID.
>> +#[repr(C)]
>> +#[derive(Debug, Clone, Ord, PartialOrd, Eq, PartialEq)]
>> +pub struct ObjectID {
>> +    pub hash: [u8; GIT_MAX_RAWSZ],
>> +    pub algo: u32,
>> +}
>
> An alternative to represent this type would be to use an enum:
>
>     pub enum ObjectID {
>         SHA1([u8; GIT_SHA1_RAWSZ]),
>         SHA256([u8; GIT_SHA256_RAWSZ]),
>     }
>
> That would give us some type safety going forward, but it might be
> harder to work with for us?

Can the latter be made interoperate with the C side well, with the
same memory layout?  Perhaps there may be a way, but the way written
in the patch looks more obviously identical to what we have on the C
side, so...

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 04/14] rust: add a ObjectID struct
  2025-10-28 19:40     ` Junio C Hamano
@ 2025-10-29  0:47       ` brian m. carlson
  0 siblings, 0 replies; 101+ messages in thread
From: brian m. carlson @ 2025-10-29  0:47 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Patrick Steinhardt, git, Ezekiel Newren

[-- Attachment #1: Type: text/plain, Size: 1106 bytes --]

On 2025-10-28 at 19:40:39, Junio C Hamano wrote:
> Yeah, another thing that puzzles me is if src/ is a good name for
> the directory in the longer run (unless we plan to rewrite
> everything in Rust, that is) for housing our source code written in
> Rust (I am assuming that *.c files are unwelcome in that directory).
> But it may be a separate topic, perhaps?

That's a standard location for Rust files.  The root of the repository
has `Cargo.toml` and `Cargo.lock`, source files go in `src`, and output
goes in `target`.  So there's not much of an option, really.

The hierarchy of the source files also affects import locations.  So
`src/hash.rs` is the `crate::hash` module and , `src/foo/bar/baz.rs` is
`crate::foo::bar::baz`.

There's no reason that `*.c` files cannot live in `src`, but Cargo pays
no attention to those (unless they're compiled with the `cc` crate as
part of `build.rs`).  We had a project at work that moved from C to Rust
incrementally and we moved all the C files into `src`, which was not a
problem.
-- 
brian m. carlson (they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 04/14] rust: add a ObjectID struct
  2025-10-28  9:17   ` Patrick Steinhardt
  2025-10-28 19:07     ` Ezekiel Newren
  2025-10-28 19:40     ` Junio C Hamano
@ 2025-10-29  0:36     ` brian m. carlson
  2025-10-29  9:08       ` Patrick Steinhardt
  2 siblings, 1 reply; 101+ messages in thread
From: brian m. carlson @ 2025-10-29  0:36 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Junio C Hamano, Ezekiel Newren

[-- Attachment #1: Type: text/plain, Size: 792 bytes --]

On 2025-10-28 at 09:17:03, Patrick Steinhardt wrote:
> We typically don't have these headers for our C code, so why have it
> over here?

This is explained in the cover letter.

> An alternative to represent this type would be to use an enum:
> 
>     pub enum ObjectID {
>         SHA1([u8; GIT_SHA1_RAWSZ]),
>         SHA256([u8; GIT_SHA256_RAWSZ]),
>     }
> 
> That would give us some type safety going forward, but it might be
> harder to work with for us?

I agree that would be a nicer end state, but that can't be cast from C,
which we do later in the series.  The goal is to have a type that is
suitable for FFI between C and Rust and we will be able to switch once
we have no more C code using this type.
-- 
brian m. carlson (they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 04/14] rust: add a ObjectID struct
  2025-10-29  0:36     ` brian m. carlson
@ 2025-10-29  9:08       ` Patrick Steinhardt
  2025-10-30  0:32         ` brian m. carlson
  0 siblings, 1 reply; 101+ messages in thread
From: Patrick Steinhardt @ 2025-10-29  9:08 UTC (permalink / raw)
  To: brian m. carlson, git, Junio C Hamano, Ezekiel Newren

On Wed, Oct 29, 2025 at 12:36:52AM +0000, brian m. carlson wrote:
> On 2025-10-28 at 09:17:03, Patrick Steinhardt wrote:
> > We typically don't have these headers for our C code, so why have it
> > over here?
> 
> This is explained in the cover letter.
> 
> > An alternative to represent this type would be to use an enum:
> > 
> >     pub enum ObjectID {
> >         SHA1([u8; GIT_SHA1_RAWSZ]),
> >         SHA256([u8; GIT_SHA256_RAWSZ]),
> >     }
> > 
> > That would give us some type safety going forward, but it might be
> > harder to work with for us?
> 
> I agree that would be a nicer end state, but that can't be cast from C,
> which we do later in the series.  The goal is to have a type that is
> suitable for FFI between C and Rust and we will be able to switch once
> we have no more C code using this type.

Fair.

I'm mostly asking all of these questions because this is our first Rust
code in Git that is a bit more involved. So it's likely that this code
will set precedent for how future code will look like, and ideally I'd
like us to have code that is idiomatic Rust code.

With the FFI code it's of course going to be a mixed bag, as we are
somewhat bound by the C interfaces. But in the best case I'd imagine
that we have low-level FFI primitives that bridge the gap between C and
Rust, and then we build a higher-level interface on top of that which
allows us to use it in an idiomatic fashion.

I guess all of this will require a lot of iteration anyway as we gain
more familiarity with Rust in our codebase. And things don't have to be
perfect on the first try *shrug*

Thanks!

Patrick

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 04/14] rust: add a ObjectID struct
  2025-10-29  9:08       ` Patrick Steinhardt
@ 2025-10-30  0:32         ` brian m. carlson
  0 siblings, 0 replies; 101+ messages in thread
From: brian m. carlson @ 2025-10-30  0:32 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Junio C Hamano, Ezekiel Newren

[-- Attachment #1: Type: text/plain, Size: 1932 bytes --]

On 2025-10-29 at 09:08:05, Patrick Steinhardt wrote:
> I'm mostly asking all of these questions because this is our first Rust
> code in Git that is a bit more involved. So it's likely that this code
> will set precedent for how future code will look like, and ideally I'd
> like us to have code that is idiomatic Rust code.

In general, I'd like that, too, and that's a fair question.

> With the FFI code it's of course going to be a mixed bag, as we are
> somewhat bound by the C interfaces. But in the best case I'd imagine
> that we have low-level FFI primitives that bridge the gap between C and
> Rust, and then we build a higher-level interface on top of that which
> allows us to use it in an idiomatic fashion.

The reason I've made the decision to minimize conversions here is
because the object ID lookups are in a hot path in `index-pack` and
various protocol code.  If we clone the Linux repository (in SHA-1) and
want to convert it to SHA-256 as part of that clone, we may need to
convert every object and then deltify it to write the SHA-256 pack.
This is never going to really scream in terms of performance as you
might imagine, but it can be better or worse and I've tried to make it
a little better.

Similarly, if we have 500,000 refs on the remote[0], each of those
have/want pairs has to be potentially converted and we want people to
feel positively about our performance.

I will send a patch in a future series that will make this a little more
idiomatic on the Rust side as well.

> I guess all of this will require a lot of iteration anyway as we gain
> more familiarity with Rust in our codebase. And things don't have to be
> perfect on the first try *shrug*

Yeah, we'll come up with some standards and design guidance as things go
along.

[0] Some major users of Git do have this order of number of refs.
-- 
brian m. carlson (they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH 05/14] rust: add a hash algorithm abstraction
  2025-10-27  0:43 [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 brian m. carlson
                   ` (3 preceding siblings ...)
  2025-10-27  0:43 ` [PATCH 04/14] rust: add a ObjectID struct brian m. carlson
@ 2025-10-27  0:43 ` brian m. carlson
  2025-10-28  9:18   ` Patrick Steinhardt
  2025-10-28 20:00   ` Junio C Hamano
  2025-10-27  0:43 ` [PATCH 06/14] hash: add a function to look up hash algo structs brian m. carlson
                   ` (12 subsequent siblings)
  17 siblings, 2 replies; 101+ messages in thread
From: brian m. carlson @ 2025-10-27  0:43 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren

This works very similarly to the existing one in C except that it
doesn't provide any functionality to hash an object.  We don't currently
need that right now, but the use of those function pointers do make it
substantially more difficult to write a bit-for-bit identical structure
across the C/Rust interface, so omit them for now.

Instead of the more customary "&self", use "self", because the former is
the size of a pointer and the latter is the size of an integer on most
systems.  Don't define an unknown value but use an Option for that
instead.

Update the object ID structure to allow slicing the data appropriately
for the algorithm.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 src/hash.rs | 142 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 142 insertions(+)

diff --git a/src/hash.rs b/src/hash.rs
index 0219391820..1b9f07489e 100644
--- a/src/hash.rs
+++ b/src/hash.rs
@@ -19,3 +19,145 @@ pub struct ObjectID {
     pub hash: [u8; GIT_MAX_RAWSZ],
     pub algo: u32,
 }
+
+#[allow(dead_code)]
+impl ObjectID {
+    pub fn as_slice(&self) -> &[u8] {
+        match HashAlgorithm::from_u32(self.algo) {
+            Some(algo) => &self.hash[0..algo.raw_len()],
+            None => &self.hash,
+        }
+    }
+
+    pub fn as_mut_slice(&mut self) -> &mut [u8] {
+        match HashAlgorithm::from_u32(self.algo) {
+            Some(algo) => &mut self.hash[0..algo.raw_len()],
+            None => &mut self.hash,
+        }
+    }
+}
+
+/// A hash algorithm,
+#[repr(C)]
+#[derive(Debug, Copy, Clone, Ord, PartialOrd, Eq, PartialEq)]
+pub enum HashAlgorithm {
+    SHA1 = 1,
+    SHA256 = 2,
+}
+
+#[allow(dead_code)]
+impl HashAlgorithm {
+    const SHA1_NULL_OID: ObjectID = ObjectID {
+        hash: [0u8; 32],
+        algo: Self::SHA1 as u32,
+    };
+    const SHA256_NULL_OID: ObjectID = ObjectID {
+        hash: [0u8; 32],
+        algo: Self::SHA256 as u32,
+    };
+
+    const SHA1_EMPTY_TREE: ObjectID = ObjectID {
+        hash: *b"\x4b\x82\x5d\xc6\x42\xcb\x6e\xb9\xa0\x60\xe5\x4b\xf8\xd6\x92\x88\xfb\xee\x49\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00",
+        algo: Self::SHA1 as u32,
+    };
+    const SHA256_EMPTY_TREE: ObjectID = ObjectID {
+        hash: *b"\x6e\xf1\x9b\x41\x22\x5c\x53\x69\xf1\xc1\x04\xd4\x5d\x8d\x85\xef\xa9\xb0\x57\xb5\x3b\x14\xb4\xb9\xb9\x39\xdd\x74\xde\xcc\x53\x21",
+        algo: Self::SHA256 as u32,
+    };
+
+    const SHA1_EMPTY_BLOB: ObjectID = ObjectID {
+        hash: *b"\xe6\x9d\xe2\x9b\xb2\xd1\xd6\x43\x4b\x8b\x29\xae\x77\x5a\xd8\xc2\xe4\x8c\x53\x91\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00",
+        algo: Self::SHA1 as u32,
+    };
+    const SHA256_EMPTY_BLOB: ObjectID = ObjectID {
+        hash: *b"\x47\x3a\x0f\x4c\x3b\xe8\xa9\x36\x81\xa2\x67\xe3\xb1\xe9\xa7\xdc\xda\x11\x85\x43\x6f\xe1\x41\xf7\x74\x91\x20\xa3\x03\x72\x18\x13",
+        algo: Self::SHA256 as u32,
+    };
+
+    /// Return a hash algorithm based on the internal integer ID used by Git.
+    ///
+    /// Returns `None` if the algorithm doesn't indicate a valid algorithm.
+    pub const fn from_u32(algo: u32) -> Option<HashAlgorithm> {
+        match algo {
+            1 => Some(HashAlgorithm::SHA1),
+            2 => Some(HashAlgorithm::SHA256),
+            _ => None,
+        }
+    }
+
+    /// Return a hash algorithm based on the internal integer ID used by Git.
+    ///
+    /// Returns `None` if the algorithm doesn't indicate a valid algorithm.
+    pub const fn from_format_id(algo: u32) -> Option<HashAlgorithm> {
+        match algo {
+            0x73686131 => Some(HashAlgorithm::SHA1),
+            0x73323536 => Some(HashAlgorithm::SHA256),
+            _ => None,
+        }
+    }
+
+    /// The name of this hash algorithm as a string suitable for the configuration file.
+    pub const fn name(self) -> &'static str {
+        match self {
+            HashAlgorithm::SHA1 => "sha1",
+            HashAlgorithm::SHA256 => "sha256",
+        }
+    }
+
+    /// The format ID of this algorithm for binary formats.
+    ///
+    /// Note that when writing this to a data format, it should be written in big-endian format
+    /// explicitly.
+    pub const fn format_id(self) -> u32 {
+        match self {
+            HashAlgorithm::SHA1 => 0x73686131,
+            HashAlgorithm::SHA256 => 0x73323536,
+        }
+    }
+
+    /// The length of binary object IDs in this algorithm in bytes.
+    pub const fn raw_len(self) -> usize {
+        match self {
+            HashAlgorithm::SHA1 => 20,
+            HashAlgorithm::SHA256 => 32,
+        }
+    }
+
+    /// The length of object IDs in this algorithm in hexadecimal characters.
+    pub const fn hex_len(self) -> usize {
+        self.raw_len() * 2
+    }
+
+    /// The number of bytes which is processed by one iteration of this algorithm's compression
+    /// function.
+    pub const fn block_size(self) -> usize {
+        match self {
+            HashAlgorithm::SHA1 => 64,
+            HashAlgorithm::SHA256 => 64,
+        }
+    }
+
+    /// The object ID representing the empty blob.
+    pub const fn empty_blob(self) -> &'static ObjectID {
+        match self {
+            HashAlgorithm::SHA1 => &Self::SHA1_EMPTY_BLOB,
+            HashAlgorithm::SHA256 => &Self::SHA256_EMPTY_BLOB,
+        }
+    }
+
+    /// The object ID representing the empty tree.
+    pub const fn empty_tree(self) -> &'static ObjectID {
+        match self {
+            HashAlgorithm::SHA1 => &Self::SHA1_EMPTY_TREE,
+            HashAlgorithm::SHA256 => &Self::SHA256_EMPTY_TREE,
+        }
+    }
+
+    /// The object ID which is all zeros.
+    pub const fn null_oid(self) -> &'static ObjectID {
+        match self {
+            HashAlgorithm::SHA1 => &Self::SHA1_NULL_OID,
+            HashAlgorithm::SHA256 => &Self::SHA256_NULL_OID,
+        }
+    }
+}

^ permalink raw reply related	[flat|nested] 101+ messages in thread

* Re: [PATCH 05/14] rust: add a hash algorithm abstraction
  2025-10-27  0:43 ` [PATCH 05/14] rust: add a hash algorithm abstraction brian m. carlson
@ 2025-10-28  9:18   ` Patrick Steinhardt
  2025-10-28 17:09     ` Ezekiel Newren
  2025-10-28 20:00   ` Junio C Hamano
  1 sibling, 1 reply; 101+ messages in thread
From: Patrick Steinhardt @ 2025-10-28  9:18 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Junio C Hamano, Ezekiel Newren

On Mon, Oct 27, 2025 at 12:43:55AM +0000, brian m. carlson wrote:
> diff --git a/src/hash.rs b/src/hash.rs
> index 0219391820..1b9f07489e 100644
> --- a/src/hash.rs
> +++ b/src/hash.rs
> @@ -19,3 +19,145 @@ pub struct ObjectID {
>      pub hash: [u8; GIT_MAX_RAWSZ],
>      pub algo: u32,
>  }
> +
> +#[allow(dead_code)]
> +impl ObjectID {
> +    pub fn as_slice(&self) -> &[u8] {
> +        match HashAlgorithm::from_u32(self.algo) {
> +            Some(algo) => &self.hash[0..algo.raw_len()],
> +            None => &self.hash,
> +        }
> +    }
> +
> +    pub fn as_mut_slice(&mut self) -> &mut [u8] {
> +        match HashAlgorithm::from_u32(self.algo) {
> +            Some(algo) => &mut self.hash[0..algo.raw_len()],
> +            None => &mut self.hash,
> +        }
> +    }
> +}
> +
> +/// A hash algorithm,
> +#[repr(C)]
> +#[derive(Debug, Copy, Clone, Ord, PartialOrd, Eq, PartialEq)]
> +pub enum HashAlgorithm {
> +    SHA1 = 1,
> +    SHA256 = 2,
> +}
> +

Seeing all the `match` statements: we could alternatively implement this
as a trait. This would have the added benefit that we cannot miss
updating any of the functions if we ever were to add another hash
function.

Patrick

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 05/14] rust: add a hash algorithm abstraction
  2025-10-28  9:18   ` Patrick Steinhardt
@ 2025-10-28 17:09     ` Ezekiel Newren
  0 siblings, 0 replies; 101+ messages in thread
From: Ezekiel Newren @ 2025-10-28 17:09 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: brian m. carlson, git, Junio C Hamano

On Tue, Oct 28, 2025 at 3:18 AM Patrick Steinhardt <ps@pks.im> wrote:
>
> On Mon, Oct 27, 2025 at 12:43:55AM +0000, brian m. carlson wrote:
> > diff --git a/src/hash.rs b/src/hash.rs
> > index 0219391820..1b9f07489e 100644
> > --- a/src/hash.rs
> > +++ b/src/hash.rs
> > @@ -19,3 +19,145 @@ pub struct ObjectID {
> >      pub hash: [u8; GIT_MAX_RAWSZ],
> >      pub algo: u32,
> >  }
> > +
> > +#[allow(dead_code)]
> > +impl ObjectID {
> > +    pub fn as_slice(&self) -> &[u8] {
> > +        match HashAlgorithm::from_u32(self.algo) {
> > +            Some(algo) => &self.hash[0..algo.raw_len()],
> > +            None => &self.hash,
> > +        }
> > +    }
> > +
> > +    pub fn as_mut_slice(&mut self) -> &mut [u8] {
> > +        match HashAlgorithm::from_u32(self.algo) {
> > +            Some(algo) => &mut self.hash[0..algo.raw_len()],
> > +            None => &mut self.hash,
> > +        }
> > +    }
> > +}
> > +
> > +/// A hash algorithm,
> > +#[repr(C)]
> > +#[derive(Debug, Copy, Clone, Ord, PartialOrd, Eq, PartialEq)]
> > +pub enum HashAlgorithm {
> > +    SHA1 = 1,
> > +    SHA256 = 2,
> > +}
> > +
>
> Seeing all the `match` statements: we could alternatively implement this
> as a trait. This would have the added benefit that we cannot miss
> updating any of the functions if we ever were to add another hash
> function.

match is more strict than switch. If another enum type is added then
the current code will not compile. While I do like the idea of using
traits the problem is that the hash algorithm used needs to be known
on disk. We can still use traits, but in conjunction with this enum.
The part where we need to be careful is HashAlgorithm::from_u32()
because if _3_ ever becomes valid then this code (currently) will say
it's not.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 05/14] rust: add a hash algorithm abstraction
  2025-10-27  0:43 ` [PATCH 05/14] rust: add a hash algorithm abstraction brian m. carlson
  2025-10-28  9:18   ` Patrick Steinhardt
@ 2025-10-28 20:00   ` Junio C Hamano
  2025-10-28 20:03     ` Ezekiel Newren
  1 sibling, 1 reply; 101+ messages in thread
From: Junio C Hamano @ 2025-10-28 20:00 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Patrick Steinhardt, Ezekiel Newren

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> +#[allow(dead_code)]
> +impl ObjectID {
> +    pub fn as_slice(&self) -> &[u8] {
> +        match HashAlgorithm::from_u32(self.algo) {
> +            Some(algo) => &self.hash[0..algo.raw_len()],
> +            None => &self.hash,
> +        }
> +    }
> +
> +    pub fn as_mut_slice(&mut self) -> &mut [u8] {
> +        match HashAlgorithm::from_u32(self.algo) {
> +            Some(algo) => &mut self.hash[0..algo.raw_len()],
> +            None => &mut self.hash,
> +        }
> +    }
> +}

These cases for "None" surprised me a bit; I would have expected us
to error out when given an algorithm we do not recognise.

> +    /// Return a hash algorithm based on the internal integer ID used by Git.
> +    ///
> +    /// Returns `None` if the algorithm doesn't indicate a valid algorithm.
> +    pub const fn from_u32(algo: u32) -> Option<HashAlgorithm> {
> +        match algo {
> +            1 => Some(HashAlgorithm::SHA1),
> +            2 => Some(HashAlgorithm::SHA256),
> +            _ => None,
> +        }
> +    }
> +
> +    /// Return a hash algorithm based on the internal integer ID used by Git.
> +    ///
> +    /// Returns `None` if the algorithm doesn't indicate a valid algorithm.
> +    pub const fn from_format_id(algo: u32) -> Option<HashAlgorithm> {
> +        match algo {
> +            0x73686131 => Some(HashAlgorithm::SHA1),
> +            0x73323536 => Some(HashAlgorithm::SHA256),
> +            _ => None,
> +        }
> +    }

> +    /// The number of bytes which is processed by one iteration of this algorithm's compression
> +    /// function.
> +    pub const fn block_size(self) -> usize {
> +        match self {
> +            HashAlgorithm::SHA1 => 64,
> +            HashAlgorithm::SHA256 => 64,
> +        }
> +    }

What we see in this patch seems to be a fairly complete rewrite of
what we have in <hash.h>.  I totally forgot that we had this "block
size" there, which is only used in receive-pack.c when we compute
the push certificate.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 05/14] rust: add a hash algorithm abstraction
  2025-10-28 20:00   ` Junio C Hamano
@ 2025-10-28 20:03     ` Ezekiel Newren
  2025-10-29 13:27       ` Junio C Hamano
  0 siblings, 1 reply; 101+ messages in thread
From: Ezekiel Newren @ 2025-10-28 20:03 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: brian m. carlson, git, Patrick Steinhardt

On Tue, Oct 28, 2025 at 2:00 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "brian m. carlson" <sandals@crustytoothpaste.net> writes:
>
> > +#[allow(dead_code)]
> > +impl ObjectID {
> > +    pub fn as_slice(&self) -> &[u8] {
> > +        match HashAlgorithm::from_u32(self.algo) {
> > +            Some(algo) => &self.hash[0..algo.raw_len()],
> > +            None => &self.hash,
> > +        }
> > +    }
> > +
> > +    pub fn as_mut_slice(&mut self) -> &mut [u8] {
> > +        match HashAlgorithm::from_u32(self.algo) {
> > +            Some(algo) => &mut self.hash[0..algo.raw_len()],
> > +            None => &mut self.hash,
> > +        }
> > +    }
> > +}
>
> These cases for "None" surprised me a bit; I would have expected us
> to error out when given an algorithm we do not recognise.

I think _Result_ would be more appropriate here.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 05/14] rust: add a hash algorithm abstraction
  2025-10-28 20:03     ` Ezekiel Newren
@ 2025-10-29 13:27       ` Junio C Hamano
  2025-10-29 14:32         ` Junio C Hamano
  0 siblings, 1 reply; 101+ messages in thread
From: Junio C Hamano @ 2025-10-29 13:27 UTC (permalink / raw)
  To: Ezekiel Newren; +Cc: brian m. carlson, git, Patrick Steinhardt

Ezekiel Newren <ezekielnewren@gmail.com> writes:

>> > +impl ObjectID {
>> > +    pub fn as_slice(&self) -> &[u8] {
>> > +        match HashAlgorithm::from_u32(self.algo) {
>> > +            Some(algo) => &self.hash[0..algo.raw_len()],
>> > +            None => &self.hash,
>> > +        }
>> > +    }
>> > +
>> > +    pub fn as_mut_slice(&mut self) -> &mut [u8] {
>> > +        match HashAlgorithm::from_u32(self.algo) {
>> > +            Some(algo) => &mut self.hash[0..algo.raw_len()],
>> > +            None => &mut self.hash,
>> > +        }
>> > +    }
>> > +}
>>
>> These cases for "None" surprised me a bit; I would have expected us
>> to error out when given an algorithm we do not recognise.
>
> I think _Result_ would be more appropriate here.

Perhaps.  But the Option/Result was not what I was suprised about.

When algo is available, we gave back a slice that is properly sized,
but when algo is not, I would have expected it to say "nope",
instead of yielding the full area of memory available.  That was the
part I was surprised about.

Perhaps as_mut_slice() side is justifiable (an uninitialized
instance of ObjectID is filled by getting the full self.hash and
filling it, plus filling the algo), but the same explanation would
not apply on the read-only side.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 05/14] rust: add a hash algorithm abstraction
  2025-10-29 13:27       ` Junio C Hamano
@ 2025-10-29 14:32         ` Junio C Hamano
  0 siblings, 0 replies; 101+ messages in thread
From: Junio C Hamano @ 2025-10-29 14:32 UTC (permalink / raw)
  To: Ezekiel Newren; +Cc: brian m. carlson, git, Patrick Steinhardt

Junio C Hamano <gitster@pobox.com> writes:

>>> These cases for "None" surprised me a bit; I would have expected us
>>> to error out when given an algorithm we do not recognise.
>>
>> I think _Result_ would be more appropriate here.
>
> Perhaps.  But the Option/Result was not what I was suprised about.
> ...
> Perhaps as_mut_slice() side is justifiable (an uninitialized
> instance of ObjectID is filled by getting the full self.hash and
> filling it, plus filling the algo), but the same explanation would
> not apply on the read-only side.

Rethinking, I guess the "why doesn't it fail in the None case?" is
exactly the same question as "why Option, not Result?" as you
suggested.  Sorry for the noise.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH 06/14] hash: add a function to look up hash algo structs
  2025-10-27  0:43 [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 brian m. carlson
                   ` (4 preceding siblings ...)
  2025-10-27  0:43 ` [PATCH 05/14] rust: add a hash algorithm abstraction brian m. carlson
@ 2025-10-27  0:43 ` brian m. carlson
  2025-10-28  9:18   ` Patrick Steinhardt
  2025-10-28 20:12   ` Junio C Hamano
  2025-10-27  0:43 ` [PATCH 07/14] csum-file: define hashwrite's count as a uint32_t brian m. carlson
                   ` (11 subsequent siblings)
  17 siblings, 2 replies; 101+ messages in thread
From: brian m. carlson @ 2025-10-27  0:43 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren

In C, it's easy for us to look up a hash algorithm structure by its
offset by simply indexing the hash_algos array.  However, in Rust, we
sometimes need a pointer to pass to a C function, but we have our own
hash algorithm abstraction.

To get one from the other, let's provide a simple function that looks up
the C structure from the offset and expose it in Rust.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 hash.c      |  5 +++++
 hash.h      |  1 +
 src/hash.rs | 15 +++++++++++++++
 3 files changed, 21 insertions(+)

diff --git a/hash.c b/hash.c
index 81b4f87027..2f4e88e501 100644
--- a/hash.c
+++ b/hash.c
@@ -241,6 +241,11 @@ const char *empty_tree_oid_hex(const struct git_hash_algo *algop)
 	return oid_to_hex_r(buf, algop->empty_tree);
 }
 
+const struct git_hash_algo *hash_algo_ptr_by_offset(uint32_t algo)
+{
+	return &hash_algos[algo];
+}
+
 uint32_t hash_algo_by_name(const char *name)
 {
 	if (!name)
diff --git a/hash.h b/hash.h
index 99c9c2a0a8..c47ac81989 100644
--- a/hash.h
+++ b/hash.h
@@ -340,6 +340,7 @@ static inline void git_hash_final_oid(struct object_id *oid, struct git_hash_ctx
 	ctx->algop->final_oid_fn(oid, ctx);
 }
 
+const struct git_hash_algo *hash_algo_ptr_by_offset(uint32_t algo);
 /*
  * Return a GIT_HASH_* constant based on the name.  Returns GIT_HASH_UNKNOWN if
  * the name doesn't match a known algorithm.
diff --git a/src/hash.rs b/src/hash.rs
index 1b9f07489e..a5b9493bd8 100644
--- a/src/hash.rs
+++ b/src/hash.rs
@@ -10,6 +10,8 @@
 // You should have received a copy of the GNU General Public License along
 // with this program; if not, see <https://www.gnu.org/licenses/>.
 
+use std::os::raw::c_void;
+
 pub const GIT_MAX_RAWSZ: usize = 32;
 
 /// A binary object ID.
@@ -160,4 +162,17 @@ impl HashAlgorithm {
             HashAlgorithm::SHA256 => &Self::SHA256_NULL_OID,
         }
     }
+
+    /// A pointer to the C `struct git_hash_algo` for interoperability with C.
+    pub fn hash_algo_ptr(self) -> *const c_void {
+        unsafe { c::hash_algo_ptr_by_offset(self as u32) }
+    }
+}
+
+pub mod c {
+    use std::os::raw::c_void;
+
+    extern "C" {
+        pub fn hash_algo_ptr_by_offset(n: u32) -> *const c_void;
+    }
 }

^ permalink raw reply related	[flat|nested] 101+ messages in thread

* Re: [PATCH 06/14] hash: add a function to look up hash algo structs
  2025-10-27  0:43 ` [PATCH 06/14] hash: add a function to look up hash algo structs brian m. carlson
@ 2025-10-28  9:18   ` Patrick Steinhardt
  2025-10-28 20:12   ` Junio C Hamano
  1 sibling, 0 replies; 101+ messages in thread
From: Patrick Steinhardt @ 2025-10-28  9:18 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Junio C Hamano, Ezekiel Newren

On Mon, Oct 27, 2025 at 12:43:56AM +0000, brian m. carlson wrote:
> diff --git a/hash.c b/hash.c
> index 81b4f87027..2f4e88e501 100644
> --- a/hash.c
> +++ b/hash.c
> @@ -241,6 +241,11 @@ const char *empty_tree_oid_hex(const struct git_hash_algo *algop)
>  	return oid_to_hex_r(buf, algop->empty_tree);
>  }
>  
> +const struct git_hash_algo *hash_algo_ptr_by_offset(uint32_t algo)
> +{
> +	return &hash_algos[algo];
> +}

I think we should have some safety mechanisms here to verify that we
don't cause an out-of-bounds access.

> diff --git a/src/hash.rs b/src/hash.rs
> index 1b9f07489e..a5b9493bd8 100644
> --- a/src/hash.rs
> +++ b/src/hash.rs
> @@ -160,4 +162,17 @@ impl HashAlgorithm {
>              HashAlgorithm::SHA256 => &Self::SHA256_NULL_OID,
>          }
>      }
> +
> +    /// A pointer to the C `struct git_hash_algo` for interoperability with C.
> +    pub fn hash_algo_ptr(self) -> *const c_void {
> +        unsafe { c::hash_algo_ptr_by_offset(self as u32) }
> +    }
> +}
> +
> +pub mod c {
> +    use std::os::raw::c_void;
> +
> +    extern "C" {
> +        pub fn hash_algo_ptr_by_offset(n: u32) -> *const c_void;
> +    }
>  }

I guess eventually we should replace such declarations via bindgen. If
so, we could also pull in the `struct git_hash_algo` declaration and
have the function reutrn that structure instead of a oid pointer.

Patrick

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 06/14] hash: add a function to look up hash algo structs
  2025-10-27  0:43 ` [PATCH 06/14] hash: add a function to look up hash algo structs brian m. carlson
  2025-10-28  9:18   ` Patrick Steinhardt
@ 2025-10-28 20:12   ` Junio C Hamano
  2025-11-04  1:48     ` brian m. carlson
  1 sibling, 1 reply; 101+ messages in thread
From: Junio C Hamano @ 2025-10-28 20:12 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Patrick Steinhardt, Ezekiel Newren

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> In C, it's easy for us to look up a hash algorithm structure by its
> offset by simply indexing the hash_algos array.  However, in Rust, we
> sometimes need a pointer to pass to a C function, but we have our own
> hash algorithm abstraction.
>
> To get one from the other, let's provide a simple function that looks up
> the C structure from the offset and expose it in Rust.
>
> Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
> ---
>  hash.c      |  5 +++++
>  hash.h      |  1 +
>  src/hash.rs | 15 +++++++++++++++
>  3 files changed, 21 insertions(+)
>
> diff --git a/hash.c b/hash.c
> index 81b4f87027..2f4e88e501 100644
> --- a/hash.c
> +++ b/hash.c
> @@ -241,6 +241,11 @@ const char *empty_tree_oid_hex(const struct git_hash_algo *algop)
>  	return oid_to_hex_r(buf, algop->empty_tree);
>  }
>  
> +const struct git_hash_algo *hash_algo_ptr_by_offset(uint32_t algo)
> +{
> +	return &hash_algos[algo];
> +}

Hmph, technically "algo" may be an "offset" into the array, but I'd
consider it an implementation detail.  We have hash_algo instances
floating somewhere in-core, and have a way to obtain a pointer to
one of these instances by "algorithm number".  For the user of the
API, the fact that these instances are stored in contiguous pieces
of memory as an array of struct is totally irrelevant.  For that
reason, I was somewhat repelled by the "by-offset" part of the
function name.

The next function ...

>  uint32_t hash_algo_by_name(const char *name)

... calls what it returns "hash_algo", but the "hash_algo" returned
by this new function is quite different.  One is just the "algorithm
number", while the other is "algorithm instance".  Perhaps calling
both with the same name "hash algo" is the true source of confusing
naming of this new function?

> +use std::os::raw::c_void;
> +
>  pub const GIT_MAX_RAWSZ: usize = 32;
>  
>  /// A binary object ID.
> @@ -160,4 +162,17 @@ impl HashAlgorithm {
>              HashAlgorithm::SHA256 => &Self::SHA256_NULL_OID,
>          }
>      }
> +
> +    /// A pointer to the C `struct git_hash_algo` for interoperability with C.
> +    pub fn hash_algo_ptr(self) -> *const c_void {
> +        unsafe { c::hash_algo_ptr_by_offset(self as u32) }
> +    }
> +}
> +
> +pub mod c {
> +    use std::os::raw::c_void;
> +
> +    extern "C" {
> +        pub fn hash_algo_ptr_by_offset(n: u32) -> *const c_void;
> +    }
>  }

I am somewhat surprised that we do not expose "struct git_hash_algo"
the same way a previous step exposed "struct object_id" in C as
"struct ObjectID" in Rust, but instead pass its address as a void
pointer.  Hopefully the reason for doing so may become apparent as I
read further into the series?

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 06/14] hash: add a function to look up hash algo structs
  2025-10-28 20:12   ` Junio C Hamano
@ 2025-11-04  1:48     ` brian m. carlson
  2025-11-04 10:24       ` Junio C Hamano
  0 siblings, 1 reply; 101+ messages in thread
From: brian m. carlson @ 2025-11-04  1:48 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Patrick Steinhardt, Ezekiel Newren

[-- Attachment #1: Type: text/plain, Size: 3162 bytes --]

On 2025-10-28 at 20:12:30, Junio C Hamano wrote:
> "brian m. carlson" <sandals@crustytoothpaste.net> writes:
> 
> > +const struct git_hash_algo *hash_algo_ptr_by_offset(uint32_t algo)
> > +{
> > +	return &hash_algos[algo];
> > +}
> 
> Hmph, technically "algo" may be an "offset" into the array, but I'd
> consider it an implementation detail.  We have hash_algo instances
> floating somewhere in-core, and have a way to obtain a pointer to
> one of these instances by "algorithm number".  For the user of the
> API, the fact that these instances are stored in contiguous pieces
> of memory as an array of struct is totally irrelevant.  For that
> reason, I was somewhat repelled by the "by-offset" part of the
> function name.

I fear I don't have a better name.  "by_id" is the format ID.  I could
write "hash_algo_ptr_by_hash_algo" but that seems slightly bizarre and
difficult to type.  I could do "by_index", but you might have the same
objection to that name.  Would you like to propose a nicer alternative?

> The next function ...
> 
> >  uint32_t hash_algo_by_name(const char *name)
> 
> ... calls what it returns "hash_algo", but the "hash_algo" returned
> by this new function is quite different.  One is just the "algorithm
> number", while the other is "algorithm instance".  Perhaps calling
> both with the same name "hash algo" is the true source of confusing
> naming of this new function?

Note that the name is "hash_algo_ptr", not "hash_algo".  That is, we're
explicitly returning a pointer to the structure here.  I realize that's
slightly hard to notice at first glance, but it was intentional.  I had
the same thought about using "hash_algo" as you did and for that reason
decided to not create an ambiguous name.

> I am somewhat surprised that we do not expose "struct git_hash_algo"
> the same way a previous step exposed "struct object_id" in C as
> "struct ObjectID" in Rust, but instead pass its address as a void
> pointer.  Hopefully the reason for doing so may become apparent as I
> read further into the series?

We're going to replace this with a nicer abstraction in Rust.  Since we
don't have bindgen or cbindgen yet, it's going to be kind of tricky to
deal with the complexities of the structure such that we get it
correctly aligned and matching and we only need to use it when working
with C, so we don't bother to write out the details here.

I certainly haven't measured, but I think the Rust compiler will be able
to better optimize a function like `raw_len` with two explicit
possibilities, especially when its `const`[0], than the C compiler will
with reading what could be an arbitrary value out of the `rawsz` member.
Because it's const, the compiler absolutely will be able to evaluate the
size of anything where the hash algorithm is known at compile time and
the fact that `hex_len` is defined in terms of `raw_len` provides a
helpful hint for the compiler as well in that one is always twice the
other.

[0] `const` for a function meaning in this case that it can be evaluated
at compile time.
-- 
brian m. carlson (they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 06/14] hash: add a function to look up hash algo structs
  2025-11-04  1:48     ` brian m. carlson
@ 2025-11-04 10:24       ` Junio C Hamano
  0 siblings, 0 replies; 101+ messages in thread
From: Junio C Hamano @ 2025-11-04 10:24 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Patrick Steinhardt, Ezekiel Newren

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

>> > +const struct git_hash_algo *hash_algo_ptr_by_offset(uint32_t algo)
>> > +{
>> > +	return &hash_algos[algo];
>> > +}
>> 
>> Hmph, technically "algo" may be an "offset" into the array, but I'd
>> consider it an implementation detail.  We have hash_algo instances
>> floating somewhere in-core, and have a way to obtain a pointer to
>> one of these instances by "algorithm number".  For the user of the
>> API, the fact that these instances are stored in contiguous pieces
>> of memory as an array of struct is totally irrelevant.  For that
>> reason, I was somewhat repelled by the "by-offset" part of the
>> function name.
>
> I fear I don't have a better name.  "by_id" is the format ID.  I could
> write "hash_algo_ptr_by_hash_algo" but that seems slightly bizarre and
> difficult to type.  I could do "by_index", but you might have the same
> objection to that name.  Would you like to propose a nicer alternative?

const struct git_hash_algo *hash_algo_ptr_by_algo_number(uint32_t algo_num)
{
	return &hash_algos[algo_num];
}

Then, ...

>> The next function ...
>> 
>> >  uint32_t hash_algo_by_name(const char *name)
>> 
>> ... calls what it returns "hash_algo", but the "hash_algo" returned
>> by this new function is quite different.  One is just the "algorithm
>> number", while the other is "algorithm instance".  Perhaps calling
>> both with the same name "hash algo" is the true source of confusing
>> naming of this new function?

... would become

uint32_t hash_algo_num_by_name(const char *name)

perhaps.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH 07/14] csum-file: define hashwrite's count as a uint32_t
  2025-10-27  0:43 [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 brian m. carlson
                   ` (5 preceding siblings ...)
  2025-10-27  0:43 ` [PATCH 06/14] hash: add a function to look up hash algo structs brian m. carlson
@ 2025-10-27  0:43 ` brian m. carlson
  2025-10-28 17:22   ` Ezekiel Newren
  2025-10-27  0:43 ` [PATCH 08/14] write-or-die: add an fsync component for the loose object map brian m. carlson
                   ` (10 subsequent siblings)
  17 siblings, 1 reply; 101+ messages in thread
From: brian m. carlson @ 2025-10-27  0:43 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren

We want to call this code from Rust and ensure that the types are the
same for compatibility, which is easiest to do if the type is a fixed
size.  Since unsigned int is 32 bits on all the platforms we care about,
define it as a uint32_t instead.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 csum-file.c | 2 +-
 csum-file.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/csum-file.c b/csum-file.c
index 6e21e3cac8..3d3047c776 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -110,7 +110,7 @@ void discard_hashfile(struct hashfile *f)
 	free_hashfile(f);
 }
 
-void hashwrite(struct hashfile *f, const void *buf, unsigned int count)
+void hashwrite(struct hashfile *f, const void *buf, uint32_t count)
 {
 	while (count) {
 		unsigned left = f->buffer_len - f->offset;
diff --git a/csum-file.h b/csum-file.h
index 07ae11024a..ecce9d27b0 100644
--- a/csum-file.h
+++ b/csum-file.h
@@ -63,7 +63,7 @@ void free_hashfile(struct hashfile *f);
  */
 int finalize_hashfile(struct hashfile *, unsigned char *, enum fsync_component, unsigned int);
 void discard_hashfile(struct hashfile *);
-void hashwrite(struct hashfile *, const void *, unsigned int);
+void hashwrite(struct hashfile *, const void *, uint32_t);
 void hashflush(struct hashfile *f);
 void crc32_begin(struct hashfile *);
 uint32_t crc32_end(struct hashfile *);

^ permalink raw reply related	[flat|nested] 101+ messages in thread

* Re: [PATCH 07/14] csum-file: define hashwrite's count as a uint32_t
  2025-10-27  0:43 ` [PATCH 07/14] csum-file: define hashwrite's count as a uint32_t brian m. carlson
@ 2025-10-28 17:22   ` Ezekiel Newren
  0 siblings, 0 replies; 101+ messages in thread
From: Ezekiel Newren @ 2025-10-28 17:22 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Junio C Hamano, Patrick Steinhardt

On Sun, Oct 26, 2025 at 6:44 PM brian m. carlson
<sandals@crustytoothpaste.net> wrote:
>
> We want to call this code from Rust and ensure that the types are the
> same for compatibility, which is easiest to do if the type is a fixed
> size.  Since unsigned int is 32 bits on all the platforms we care about,
> define it as a uint32_t instead.

I'm always in favor of converting to unambiguous types.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH 08/14] write-or-die: add an fsync component for the loose object map
  2025-10-27  0:43 [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 brian m. carlson
                   ` (6 preceding siblings ...)
  2025-10-27  0:43 ` [PATCH 07/14] csum-file: define hashwrite's count as a uint32_t brian m. carlson
@ 2025-10-27  0:43 ` brian m. carlson
  2025-10-27  0:43 ` [PATCH 09/14] hash: expose hash context functions to Rust brian m. carlson
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 101+ messages in thread
From: brian m. carlson @ 2025-10-27  0:43 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren

We'll soon be writing out a loose object map using the hashfile code.
Add an fsync component to allow us to handle fsyncing it correctly.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 write-or-die.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/write-or-die.h b/write-or-die.h
index 65a5c42a47..8d5ec23e1f 100644
--- a/write-or-die.h
+++ b/write-or-die.h
@@ -21,6 +21,7 @@ enum fsync_component {
 	FSYNC_COMPONENT_COMMIT_GRAPH		= 1 << 3,
 	FSYNC_COMPONENT_INDEX			= 1 << 4,
 	FSYNC_COMPONENT_REFERENCE		= 1 << 5,
+	FSYNC_COMPONENT_LOOSE_OBJECT_MAP	= 1 << 6,
 };
 
 #define FSYNC_COMPONENTS_OBJECTS (FSYNC_COMPONENT_LOOSE_OBJECT | \
@@ -44,7 +45,8 @@ enum fsync_component {
 			      FSYNC_COMPONENT_PACK_METADATA | \
 			      FSYNC_COMPONENT_COMMIT_GRAPH | \
 			      FSYNC_COMPONENT_INDEX | \
-			      FSYNC_COMPONENT_REFERENCE)
+			      FSYNC_COMPONENT_REFERENCE | \
+			      FSYNC_COMPONENT_LOOSE_OBJECT_MAP)
 
 #ifndef FSYNC_COMPONENTS_PLATFORM_DEFAULT
 #define FSYNC_COMPONENTS_PLATFORM_DEFAULT FSYNC_COMPONENTS_DEFAULT

^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH 09/14] hash: expose hash context functions to Rust
  2025-10-27  0:43 [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 brian m. carlson
                   ` (7 preceding siblings ...)
  2025-10-27  0:43 ` [PATCH 08/14] write-or-die: add an fsync component for the loose object map brian m. carlson
@ 2025-10-27  0:43 ` brian m. carlson
  2025-10-29 16:32   ` Junio C Hamano
  2025-10-27  0:44 ` [PATCH 10/14] rust: add a build.rs script for tests brian m. carlson
                   ` (8 subsequent siblings)
  17 siblings, 1 reply; 101+ messages in thread
From: brian m. carlson @ 2025-10-27  0:43 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren

We'd like to be able to hash our data in Rust using the same contexts as
in C.  However, we need our helper functions to not be inline so they
can be linked into the binary appropriately.  In addition, to avoid
managing memory manually and since we don't know the size of the hash
context structure, we want to have simple alloc and free functions we
can use to make sure a context can be easily dynamically created.

Expose the helper functions and create alloc, free, and init functions
we can call.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 hash.c | 35 +++++++++++++++++++++++++++++++++++
 hash.h | 27 +++++++--------------------
 2 files changed, 42 insertions(+), 20 deletions(-)

diff --git a/hash.c b/hash.c
index 2f4e88e501..4977e13de6 100644
--- a/hash.c
+++ b/hash.c
@@ -246,6 +246,41 @@ const struct git_hash_algo *hash_algo_ptr_by_offset(uint32_t algo)
 	return &hash_algos[algo];
 }
 
+struct git_hash_ctx *git_hash_alloc(void)
+{
+	return malloc(sizeof(struct git_hash_ctx));
+}
+
+void git_hash_free(struct git_hash_ctx *ctx)
+{
+	free(ctx);
+}
+
+void git_hash_init(struct git_hash_ctx *ctx, const struct git_hash_algo *algop)
+{
+	algop->init_fn(ctx);
+}
+
+void git_hash_clone(struct git_hash_ctx *dst, const struct git_hash_ctx *src)
+{
+	src->algop->clone_fn(dst, src);
+}
+
+void git_hash_update(struct git_hash_ctx *ctx, const void *in, size_t len)
+{
+	ctx->algop->update_fn(ctx, in, len);
+}
+
+void git_hash_final(unsigned char *hash, struct git_hash_ctx *ctx)
+{
+	ctx->algop->final_fn(hash, ctx);
+}
+
+void git_hash_final_oid(struct object_id *oid, struct git_hash_ctx *ctx)
+{
+	ctx->algop->final_oid_fn(oid, ctx);
+}
+
 uint32_t hash_algo_by_name(const char *name)
 {
 	if (!name)
diff --git a/hash.h b/hash.h
index c47ac81989..a937b8aff0 100644
--- a/hash.h
+++ b/hash.h
@@ -320,27 +320,14 @@ struct git_hash_algo {
 };
 extern const struct git_hash_algo hash_algos[GIT_HASH_NALGOS];
 
-static inline void git_hash_clone(struct git_hash_ctx *dst, const struct git_hash_ctx *src)
-{
-	src->algop->clone_fn(dst, src);
-}
-
-static inline void git_hash_update(struct git_hash_ctx *ctx, const void *in, size_t len)
-{
-	ctx->algop->update_fn(ctx, in, len);
-}
-
-static inline void git_hash_final(unsigned char *hash, struct git_hash_ctx *ctx)
-{
-	ctx->algop->final_fn(hash, ctx);
-}
-
-static inline void git_hash_final_oid(struct object_id *oid, struct git_hash_ctx *ctx)
-{
-	ctx->algop->final_oid_fn(oid, ctx);
-}
-
+void git_hash_init(struct git_hash_ctx *ctx, const struct git_hash_algo *algop);
+void git_hash_clone(struct git_hash_ctx *dst, const struct git_hash_ctx *src);
+void git_hash_update(struct git_hash_ctx *ctx, const void *in, size_t len);
+void git_hash_final(unsigned char *hash, struct git_hash_ctx *ctx);
+void git_hash_final_oid(struct object_id *oid, struct git_hash_ctx *ctx);
 const struct git_hash_algo *hash_algo_ptr_by_offset(uint32_t algo);
+struct git_hash_ctx *git_hash_alloc(void);
+void git_hash_free(struct git_hash_ctx *ctx);
 /*
  * Return a GIT_HASH_* constant based on the name.  Returns GIT_HASH_UNKNOWN if
  * the name doesn't match a known algorithm.

^ permalink raw reply related	[flat|nested] 101+ messages in thread

* Re: [PATCH 09/14] hash: expose hash context functions to Rust
  2025-10-27  0:43 ` [PATCH 09/14] hash: expose hash context functions to Rust brian m. carlson
@ 2025-10-29 16:32   ` Junio C Hamano
  2025-10-30 21:42     ` brian m. carlson
  0 siblings, 1 reply; 101+ messages in thread
From: Junio C Hamano @ 2025-10-29 16:32 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Patrick Steinhardt, Ezekiel Newren

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> +struct git_hash_ctx *git_hash_alloc(void)
> +{
> +	return malloc(sizeof(struct git_hash_ctx));
> +}

Not an objection, but this looked especially curious to me because
it has been customary to use xmalloc() for a thing like this.  Going
forward, is our intention that we'd explicitly handle OOM allocation
failures ourselves, at least in the Rust part of the code base?

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 09/14] hash: expose hash context functions to Rust
  2025-10-29 16:32   ` Junio C Hamano
@ 2025-10-30 21:42     ` brian m. carlson
  2025-10-30 21:52       ` Junio C Hamano
  0 siblings, 1 reply; 101+ messages in thread
From: brian m. carlson @ 2025-10-30 21:42 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Patrick Steinhardt, Ezekiel Newren

[-- Attachment #1: Type: text/plain, Size: 733 bytes --]

On 2025-10-29 at 16:32:50, Junio C Hamano wrote:
> "brian m. carlson" <sandals@crustytoothpaste.net> writes:
> 
> > +struct git_hash_ctx *git_hash_alloc(void)
> > +{
> > +	return malloc(sizeof(struct git_hash_ctx));
> > +}
> 
> Not an objection, but this looked especially curious to me because
> it has been customary to use xmalloc() for a thing like this.  Going
> forward, is our intention that we'd explicitly handle OOM allocation
> failures ourselves, at least in the Rust part of the code base?

No, I'll change this to use `xmalloc`.  Rust handles allocation itself
and just panics on OOM, so we will not want to handle allocation
failures ourselves.
-- 
brian m. carlson (they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 09/14] hash: expose hash context functions to Rust
  2025-10-30 21:42     ` brian m. carlson
@ 2025-10-30 21:52       ` Junio C Hamano
  0 siblings, 0 replies; 101+ messages in thread
From: Junio C Hamano @ 2025-10-30 21:52 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Patrick Steinhardt, Ezekiel Newren

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> On 2025-10-29 at 16:32:50, Junio C Hamano wrote:
>> "brian m. carlson" <sandals@crustytoothpaste.net> writes:
>> 
>> > +struct git_hash_ctx *git_hash_alloc(void)
>> > +{
>> > +	return malloc(sizeof(struct git_hash_ctx));
>> > +}
>> 
>> Not an objection, but this looked especially curious to me because
>> it has been customary to use xmalloc() for a thing like this.  Going
>> forward, is our intention that we'd explicitly handle OOM allocation
>> failures ourselves, at least in the Rust part of the code base?
>
> No, I'll change this to use `xmalloc`.  Rust handles allocation itself
> and just panics on OOM, so we will not want to handle allocation
> failures ourselves.

Thanks.

And re-reading what I wrote, it does not make much sense, as we
would want the integration go in both direction.  I should try hard
to get out of this mentality of talking about C-part and Rust-part
of the system.  What is allocated in one side needs to be able to go
to the other side and then come back seamlessly.


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH 10/14] rust: add a build.rs script for tests
  2025-10-27  0:43 [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 brian m. carlson
                   ` (8 preceding siblings ...)
  2025-10-27  0:43 ` [PATCH 09/14] hash: expose hash context functions to Rust brian m. carlson
@ 2025-10-27  0:44 ` brian m. carlson
  2025-10-28  9:18   ` Patrick Steinhardt
  2025-10-29 16:43   ` Junio C Hamano
  2025-10-27  0:44 ` [PATCH 11/14] rust: add functionality to hash an object brian m. carlson
                   ` (7 subsequent siblings)
  17 siblings, 2 replies; 101+ messages in thread
From: brian m. carlson @ 2025-10-27  0:44 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren

Cargo uses the build.rs script to determine how to compile and link a
binary.  The only binary we're generating, however, is for our tests,
but in a future commit, we're going to link against libgit.a for some
functionality and we'll need to make sure the test binaries are
complete.

Add a build.rs file for this case and specify the files we're going to
be linking against.  Because we cannot specify different dependencies
when building our static library versus our tests, update the Makefile
to specify these dependencies for our static library to avoid race
conditions during build.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 Makefile |  2 +-
 build.rs | 21 +++++++++++++++++++++
 2 files changed, 22 insertions(+), 1 deletion(-)
 create mode 100644 build.rs

diff --git a/Makefile b/Makefile
index 7e5a735ca6..7c36302717 100644
--- a/Makefile
+++ b/Makefile
@@ -2948,7 +2948,7 @@ scalar$X: scalar.o GIT-LDFLAGS $(GITLIBS)
 $(LIB_FILE): $(LIB_OBJS)
 	$(QUIET_AR)$(RM) $@ && $(AR) $(ARFLAGS) $@ $^
 
-$(RUST_LIB): Cargo.toml $(RUST_SOURCES)
+$(RUST_LIB): Cargo.toml $(RUST_SOURCES) $(XDIFF_LIB) $(LIB_FILE) $(REFTABLE_LIB)
 	$(QUIET_CARGO)cargo build $(CARGO_ARGS)
 
 .PHONY: rust
diff --git a/build.rs b/build.rs
new file mode 100644
index 0000000000..136d58c35a
--- /dev/null
+++ b/build.rs
@@ -0,0 +1,21 @@
+// This program is free software; you can redistribute it and/or modify
+// it under the terms of the GNU General Public License as published by
+// the Free Software Foundation: version 2 of the License, dated June 1991.
+//
+// This program is distributed in the hope that it will be useful,
+// but WITHOUT ANY WARRANTY; without even the implied warranty of
+// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+// GNU General Public License for more details.
+//
+// You should have received a copy of the GNU General Public License along
+// with this program; if not, see <https://www.gnu.org/licenses/>.
+
+fn main() {
+    println!("cargo::rustc-link-search=.");
+    println!("cargo::rustc-link-search=reftable");
+    println!("cargo::rustc-link-search=xdiff");
+    println!("cargo::rustc-link-lib=git");
+    println!("cargo::rustc-link-lib=reftable");
+    println!("cargo::rustc-link-lib=z");
+    println!("cargo::rustc-link-lib=xdiff");
+}

^ permalink raw reply related	[flat|nested] 101+ messages in thread

* Re: [PATCH 10/14] rust: add a build.rs script for tests
  2025-10-27  0:44 ` [PATCH 10/14] rust: add a build.rs script for tests brian m. carlson
@ 2025-10-28  9:18   ` Patrick Steinhardt
  2025-10-28 17:42     ` Ezekiel Newren
  2025-10-29 16:43   ` Junio C Hamano
  1 sibling, 1 reply; 101+ messages in thread
From: Patrick Steinhardt @ 2025-10-28  9:18 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Junio C Hamano, Ezekiel Newren

On Mon, Oct 27, 2025 at 12:44:00AM +0000, brian m. carlson wrote:
> diff --git a/Makefile b/Makefile
> index 7e5a735ca6..7c36302717 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -2948,7 +2948,7 @@ scalar$X: scalar.o GIT-LDFLAGS $(GITLIBS)
>  $(LIB_FILE): $(LIB_OBJS)
>  	$(QUIET_AR)$(RM) $@ && $(AR) $(ARFLAGS) $@ $^
>  
> -$(RUST_LIB): Cargo.toml $(RUST_SOURCES)
> +$(RUST_LIB): Cargo.toml $(RUST_SOURCES) $(XDIFF_LIB) $(LIB_FILE) $(REFTABLE_LIB)
>  	$(QUIET_CARGO)cargo build $(CARGO_ARGS)

We have recently removed the separare xdiff and reftable libraries, so
it shouldn't be necessary to have these anymore.

But one thing I'm curious about: don't we have a circular dependency
between the Rust and C library now? I guess that's somewhat expected, as
we'll want to call Rust from C and vice versa. But on the Meson side I
think we need to adjust our logic so that we don't pull the Rust library
into libgit.a to break this cycle.

> diff --git a/build.rs b/build.rs
> new file mode 100644
> index 0000000000..136d58c35a
> --- /dev/null
> +++ b/build.rs
> @@ -0,0 +1,21 @@
> +// This program is free software; you can redistribute it and/or modify
> +// it under the terms of the GNU General Public License as published by
> +// the Free Software Foundation: version 2 of the License, dated June 1991.
> +//
> +// This program is distributed in the hope that it will be useful,
> +// but WITHOUT ANY WARRANTY; without even the implied warranty of
> +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +// GNU General Public License for more details.
> +//
> +// You should have received a copy of the GNU General Public License along
> +// with this program; if not, see <https://www.gnu.org/licenses/>.
> +
> +fn main() {
> +    println!("cargo::rustc-link-search=.");
> +    println!("cargo::rustc-link-search=reftable");
> +    println!("cargo::rustc-link-search=xdiff");
> +    println!("cargo::rustc-link-lib=git");
> +    println!("cargo::rustc-link-lib=reftable");
> +    println!("cargo::rustc-link-lib=z");
> +    println!("cargo::rustc-link-lib=xdiff");
> +}

How do we ensure that the correct libraries are linked here? E.g. for
libz, if there are multiple such libraries, which one gets precedence?

Patrick

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 10/14] rust: add a build.rs script for tests
  2025-10-28  9:18   ` Patrick Steinhardt
@ 2025-10-28 17:42     ` Ezekiel Newren
  0 siblings, 0 replies; 101+ messages in thread
From: Ezekiel Newren @ 2025-10-28 17:42 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: brian m. carlson, git, Junio C Hamano

On Tue, Oct 28, 2025 at 3:18 AM Patrick Steinhardt <ps@pks.im> wrote:
>
> On Mon, Oct 27, 2025 at 12:44:00AM +0000, brian m. carlson wrote:
> > diff --git a/Makefile b/Makefile
> > index 7e5a735ca6..7c36302717 100644
> > --- a/Makefile
> > +++ b/Makefile
> > @@ -2948,7 +2948,7 @@ scalar$X: scalar.o GIT-LDFLAGS $(GITLIBS)
> >  $(LIB_FILE): $(LIB_OBJS)
> >       $(QUIET_AR)$(RM) $@ && $(AR) $(ARFLAGS) $@ $^
> >
> > -$(RUST_LIB): Cargo.toml $(RUST_SOURCES)
> > +$(RUST_LIB): Cargo.toml $(RUST_SOURCES) $(XDIFF_LIB) $(LIB_FILE) $(REFTABLE_LIB)
> >       $(QUIET_CARGO)cargo build $(CARGO_ARGS)
>
> We have recently removed the separare xdiff and reftable libraries, so
> it shouldn't be necessary to have these anymore.

Patrick is referring to my Makefile update libgit.a patch series that
has been merged into master [1].

> But one thing I'm curious about: don't we have a circular dependency
> between the Rust and C library now? I guess that's somewhat expected, as
> we'll want to call Rust from C and vice versa. But on the Meson side I
> think we need to adjust our logic so that we don't pull the Rust library
> into libgit.a to break this cycle.
>
> > diff --git a/build.rs b/build.rs
> > new file mode 100644
> > index 0000000000..136d58c35a
> > --- /dev/null
> > +++ b/build.rs
> > @@ -0,0 +1,21 @@
> > +// This program is free software; you can redistribute it and/or modify
> > +// it under the terms of the GNU General Public License as published by
> > +// the Free Software Foundation: version 2 of the License, dated June 1991.
> > +//
> > +// This program is distributed in the hope that it will be useful,
> > +// but WITHOUT ANY WARRANTY; without even the implied warranty of
> > +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > +// GNU General Public License for more details.
> > +//
> > +// You should have received a copy of the GNU General Public License along
> > +// with this program; if not, see <https://www.gnu.org/licenses/>.
> > +
> > +fn main() {
> > +    println!("cargo::rustc-link-search=.");
> > +    println!("cargo::rustc-link-search=reftable");
> > +    println!("cargo::rustc-link-search=xdiff");
> > +    println!("cargo::rustc-link-lib=git");
> > +    println!("cargo::rustc-link-lib=reftable");
> > +    println!("cargo::rustc-link-lib=z");
> > +    println!("cargo::rustc-link-lib=xdiff");
> > +}
>
> How do we ensure that the correct libraries are linked here? E.g. for
> libz, if there are multiple such libraries, which one gets precedence?

I solved this problem in my own Introduce Rust series [2,3]. When
Makefile or Meson is invoking Cargo it sets the environment variable
`USE_LINKING=false` and build.rs doesn't link against libgit.a or any
other library. When `cargo test` is called it will link against
libgit.a because if USE_LINKING is not set then it assumes true.

[1] Makefile update libgit.a
https://lore.kernel.org/git/pull.2065.v2.git.git.1759447647.gitgitgadget@gmail.com/
[2] Ezekiel's Introduce Rust
https://lore.kernel.org/git/6032a8740c0ba72420f42c3d8d801e1bdeec12d0.1758071798.git.gitgitgadget@gmail.com/
[3] Ezekiel's Introduce Rust
https://lore.kernel.org/git/6a27e07e6310b6cad0e3feae817269b9b8eaed69.1758071798.git.gitgitgadget@gmail.com/

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 10/14] rust: add a build.rs script for tests
  2025-10-27  0:44 ` [PATCH 10/14] rust: add a build.rs script for tests brian m. carlson
  2025-10-28  9:18   ` Patrick Steinhardt
@ 2025-10-29 16:43   ` Junio C Hamano
  2025-10-29 22:10     ` Ezekiel Newren
  1 sibling, 1 reply; 101+ messages in thread
From: Junio C Hamano @ 2025-10-29 16:43 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Patrick Steinhardt, Ezekiel Newren

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> Cargo uses the build.rs script to determine how to compile and link a
> binary.  The only binary we're generating, however, is for our tests,
> but in a future commit, we're going to link against libgit.a for some
> functionality and we'll need to make sure the test binaries are
> complete.

OK.

> -$(RUST_LIB): Cargo.toml $(RUST_SOURCES)
> +$(RUST_LIB): Cargo.toml $(RUST_SOURCES) $(XDIFF_LIB) $(LIB_FILE) $(REFTABLE_LIB)
>  	$(QUIET_CARGO)cargo build $(CARGO_ARGS)
> ...
> +fn main() {
> +    println!("cargo::rustc-link-search=.");
> +    println!("cargo::rustc-link-search=reftable");
> +    println!("cargo::rustc-link-search=xdiff");
> +    println!("cargo::rustc-link-lib=git");
> +    println!("cargo::rustc-link-lib=reftable");
> +    println!("cargo::rustc-link-lib=z");
> +    println!("cargo::rustc-link-lib=xdiff");
> +}

Hmm, I recall Ezekiel earlier arguing to roll reftable and xdiff
libraries into libgit.a as it is a lot more cumbersome to have to
link with multiple libraries (sorry, I may be misremembering and do
not have reference handy), but if the above is all it takes to link
with these, perhaps it is not such a huge deal?

I am a bit confused.

XDIFF_LIB and REFTABLE_LIB are gone from Makefile on 'master'
already.  Perhaps we should revert earlier series from him?  

Thanks.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 10/14] rust: add a build.rs script for tests
  2025-10-29 16:43   ` Junio C Hamano
@ 2025-10-29 22:10     ` Ezekiel Newren
  2025-10-29 23:12       ` Junio C Hamano
  0 siblings, 1 reply; 101+ messages in thread
From: Ezekiel Newren @ 2025-10-29 22:10 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: brian m. carlson, git, Patrick Steinhardt

On Wed, Oct 29, 2025 at 10:43 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> "brian m. carlson" <sandals@crustytoothpaste.net> writes:
>
> > Cargo uses the build.rs script to determine how to compile and link a
> > binary.  The only binary we're generating, however, is for our tests,
> > but in a future commit, we're going to link against libgit.a for some
> > functionality and we'll need to make sure the test binaries are
> > complete.
>
> OK.
>
> > -$(RUST_LIB): Cargo.toml $(RUST_SOURCES)
> > +$(RUST_LIB): Cargo.toml $(RUST_SOURCES) $(XDIFF_LIB) $(LIB_FILE) $(REFTABLE_LIB)
> >       $(QUIET_CARGO)cargo build $(CARGO_ARGS)
> > ...
> > +fn main() {
> > +    println!("cargo::rustc-link-search=.");
> > +    println!("cargo::rustc-link-search=reftable");
> > +    println!("cargo::rustc-link-search=xdiff");
> > +    println!("cargo::rustc-link-lib=git");
> > +    println!("cargo::rustc-link-lib=reftable");
> > +    println!("cargo::rustc-link-lib=z");
> > +    println!("cargo::rustc-link-lib=xdiff");
> > +}
>
> Hmm, I recall Ezekiel earlier arguing to roll reftable and xdiff
> libraries into libgit.a as it is a lot more cumbersome to have to
> link with multiple libraries (sorry, I may be misremembering and do
> not have reference handy), but if the above is all it takes to link
> with these, perhaps it is not such a huge deal?

I think Brian might have written this before my series was merged in.

> I am a bit confused.
>
> XDIFF_LIB and REFTABLE_LIB are gone from Makefile on 'master'
> already.  Perhaps we should revert earlier series from him?

I don't think we should revert my series. Brian should delete certain
lines like so:

 fn main() {
     println!("cargo::rustc-link-search=.");
-    println!("cargo::rustc-link-search=reftable");
-    println!("cargo::rustc-link-search=xdiff");
     println!("cargo::rustc-link-lib=git");
-    println!("cargo::rustc-link-lib=reftable");
     println!("cargo::rustc-link-lib=z");
-    println!("cargo::rustc-link-lib=xdiff");
 }

Also the makefile needs to add the flag -fPIC or -fPIE when compiling with Rust.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 10/14] rust: add a build.rs script for tests
  2025-10-29 22:10     ` Ezekiel Newren
@ 2025-10-29 23:12       ` Junio C Hamano
  2025-10-30  6:26         ` Patrick Steinhardt
  0 siblings, 1 reply; 101+ messages in thread
From: Junio C Hamano @ 2025-10-29 23:12 UTC (permalink / raw)
  To: Ezekiel Newren; +Cc: brian m. carlson, git, Patrick Steinhardt

Ezekiel Newren <ezekielnewren@gmail.com> writes:

>> Hmm, I recall Ezekiel earlier arguing to roll reftable and xdiff
>> libraries into libgit.a as it is a lot more cumbersome to have to
>> link with multiple libraries (sorry, I may be misremembering and do
>> not have reference handy), but if the above is all it takes to link
>> with these, perhaps it is not such a huge deal?
>
> I think Brian might have written this before my series was merged in.
> ...
>> I am a bit confused.
>>
>> XDIFF_LIB and REFTABLE_LIB are gone from Makefile on 'master'
>> already.  Perhaps we should revert earlier series from him?
> ...
> I don't think we should revert my series.

The order of events does not really matter, does it?

If we can happily link with more than one libraries [*], it would
give us a much more pleasant developer experience than having to
roll everything into a single library archive, no?  Or are you
saying that the way this series links these multiple libraries
somehow does not work?

You somehow manged to confuse me even more ... X-<.


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 10/14] rust: add a build.rs script for tests
  2025-10-29 23:12       ` Junio C Hamano
@ 2025-10-30  6:26         ` Patrick Steinhardt
  2025-10-30 13:54           ` Junio C Hamano
  0 siblings, 1 reply; 101+ messages in thread
From: Patrick Steinhardt @ 2025-10-30  6:26 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Ezekiel Newren, brian m. carlson, git

On Wed, Oct 29, 2025 at 04:12:05PM -0700, Junio C Hamano wrote:
> Ezekiel Newren <ezekielnewren@gmail.com> writes:
> 
> >> Hmm, I recall Ezekiel earlier arguing to roll reftable and xdiff
> >> libraries into libgit.a as it is a lot more cumbersome to have to
> >> link with multiple libraries (sorry, I may be misremembering and do
> >> not have reference handy), but if the above is all it takes to link
> >> with these, perhaps it is not such a huge deal?
> >
> > I think Brian might have written this before my series was merged in.
> > ...
> >> I am a bit confused.
> >>
> >> XDIFF_LIB and REFTABLE_LIB are gone from Makefile on 'master'
> >> already.  Perhaps we should revert earlier series from him?
> > ...
> > I don't think we should revert my series.
> 
> The order of events does not really matter, does it?
> 
> If we can happily link with more than one libraries [*], it would
> give us a much more pleasant developer experience than having to
> roll everything into a single library archive, no?  Or are you
> saying that the way this series links these multiple libraries
> somehow does not work?
> 
> You somehow manged to confuse me even more ... X-<.

Simplification was only one of the reasons we had. The other reason was
to unify how Meson and Makefiles build libgit.a, where the former wasn't
ever building separate xdiff and reftable libraries.

The question I have here is what the benefit would be to have separate
libraries. I don't really see the "more pleasant developer experience",
and I'm not really aware of any other benefits. So personally, I'm all
for the build system simplification that Ezekiel introduced.

Patrick

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 10/14] rust: add a build.rs script for tests
  2025-10-30  6:26         ` Patrick Steinhardt
@ 2025-10-30 13:54           ` Junio C Hamano
  2025-10-31 22:43             ` Ezekiel Newren
  0 siblings, 1 reply; 101+ messages in thread
From: Junio C Hamano @ 2025-10-30 13:54 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: Ezekiel Newren, brian m. carlson, git

Patrick Steinhardt <ps@pks.im> writes:

> The question I have here is what the benefit would be to have separate
> libraries.

Mostly flexibility.  If we do not value it, then that is OK, though.

And personally I would have to say that "meson rolled everything
into a single library archive" is a bad excuse---whatever came later
doing things differently from the incumbent has to have a good reason
to do things differently, or it is a regression.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 10/14] rust: add a build.rs script for tests
  2025-10-30 13:54           ` Junio C Hamano
@ 2025-10-31 22:43             ` Ezekiel Newren
  2025-11-01 11:18               ` Junio C Hamano
  0 siblings, 1 reply; 101+ messages in thread
From: Ezekiel Newren @ 2025-10-31 22:43 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Patrick Steinhardt, brian m. carlson, git

On Thu, Oct 30, 2025 at 7:54 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Patrick Steinhardt <ps@pks.im> writes:
>
> > The question I have here is what the benefit would be to have separate
> > libraries.
>
> Mostly flexibility.  If we do not value it, then that is OK, though.
>
> And personally I would have to say that "meson rolled everything
> into a single library archive" is a bad excuse---whatever came later
> doing things differently from the incumbent has to have a good reason
> to do things differently, or it is a regression.

I don't understand why "Simplify Cargo's job of linking with the build
systems of Makefile and Meson" Isn't a good enough reason by itself.
Nor do I understand why having libxdiff.a and libreftable.a produces a
better developer experience. My developer experience has been strictly
worse because of this separation. If we keep Makefile the way that it
was and change Meson to also produce separate static libraries then
we'll need to keep 3 build systems in sync with each other. If we roll
everything into libgit.a then Cargo only ever needs to know about that
static library, Meson doesn't change, and there's no question about
where new object files should be added in Makefile. If we do add a 3rd
conceptual stand alone library then we'd only need to add the source
files to Makefile and Meson, but if we insist on separate static
libraries then we'll have to add the source files (as usual) and make
sure that Makefile, Meson, and Cargo are all in agreement about the
static libraries being produced.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 10/14] rust: add a build.rs script for tests
  2025-10-31 22:43             ` Ezekiel Newren
@ 2025-11-01 11:18               ` Junio C Hamano
  0 siblings, 0 replies; 101+ messages in thread
From: Junio C Hamano @ 2025-11-01 11:18 UTC (permalink / raw)
  To: Ezekiel Newren; +Cc: Patrick Steinhardt, brian m. carlson, git

Ezekiel Newren <ezekielnewren@gmail.com> writes:

>> Mostly flexibility.  If we do not value it, then that is OK, though.
>>
>> And personally I would have to say that "meson rolled everything
>> into a single library archive" is a bad excuse---whatever came later
>> doing things differently from the incumbent has to have a good reason
>> to do things differently, or it is a regression.
>
> I don't understand why "Simplify Cargo's job of linking with the build
> systems of Makefile and Meson" Isn't a good enough reason by itself.

Was that the way it was sold, though?

    The motivation is to simplify Rust's job of linking against the
    C code by requiring it to only link against a single static
    library (libgit.a).

was how the original cover letter sold the change.  In addition, in
a later thread, I saw this:

    Like the previous two commits; This one continues the effort to
    get the Rust compiler to link against libgit.a. Meson already
    includes the reftable in its libgit.a, but Makefile does not.

It led me into (incorrectly) thinking that Rust toolchain you are
using for your series becomes very cumbersome, if not impossible, to
use, if we try to have it use more than one library.  My job as the
project lead would have been to decide if maintaining the separation
of three independent libraries was worth the hassle.

In other words, I read it as "We have to do with a single library,
due to limitations of Rust build infrastructure, and that is why we
are merging logically three separate libraries into one in the build
structure in the Makefile.  Meson based build happens to already
roll everything into one library, so we do not have to do anything
extra to implement this workaround for Rust.  Only Makefile side
needs this change."

If I knew that dealing with just one library was not a requirement
placed by Rust (and apparently, what brian did in the series under
discussion shows that it is not), I would have instead suggested to
fix the Meson based build procedure, as I do agree with the idea of
"simplifying" to avoid having to deal with 1 with Meson while 3 with
Makefile.  But I would have suggested to link the same set of three
libraries on both sides.

The fact I was (mis)lead into thinking that the only way to do so is
to roll objects from three logically independent libraries into one
(due to limitation in building Rust part of the code), when the
other way, namely, to keep them separate also in Meson based builds,
was also perfectly adequate because there is no such limitation
placed by Rust, is mostly what makes me react unnecessarily
strongly.  Yes, I am upset.

When there is no strong reason to be different for a newly
introduced thing (that is, Meson relative to Makefile), it should
avoid being different to avoid breaking expectations (e.g., we'd
have this and that .a files left in the build directory to link with
objects to produce "git").  So "I do not understand why keeping
three is good" is not an argument.  The Meson based build series
needed to justify itself why rolling everything into one library was
a good idea, but it seems nobody noticed the distinction back then
when it was introduced, and you do not have to be retroactively
defending that mistake now.  The same about position independent
code generation (I do not know if it hurts performance very much
these days, but it used to introduce measurable hit, so the benefit
needs to outweigh the cost).

In any case, it has sufficiently been long time since we lost the
other two librarres in our build, so changing it back to use three
separate libraries would be yet another breaking move that I do not
want to see---unfortunately it is way too late for that.

So brian's patch in this series may need to be rebased to a newer
base to expect a single library, I think.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH 11/14] rust: add functionality to hash an object
  2025-10-27  0:43 [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 brian m. carlson
                   ` (9 preceding siblings ...)
  2025-10-27  0:44 ` [PATCH 10/14] rust: add a build.rs script for tests brian m. carlson
@ 2025-10-27  0:44 ` brian m. carlson
  2025-10-28  9:18   ` Patrick Steinhardt
  2025-10-28 18:05   ` Ezekiel Newren
  2025-10-27  0:44 ` [PATCH 12/14] rust: add a new binary loose object map format brian m. carlson
                   ` (6 subsequent siblings)
  17 siblings, 2 replies; 101+ messages in thread
From: brian m. carlson @ 2025-10-27  0:44 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren

In a future commit, we'll want to hash some data when dealing with a
loose object map.  Let's make this easy by creating a structure to hash
objects and calling into the C functions as necessary to perform the
hashing.  For now, we only implement safe hashing, but in the future we
could add unsafe hashing if we want.  Implement Clone and Drop to
appropriately manage our memory.  Additionally implement Write to make
it easy to use with other formats that implement this trait.

While we're at it, add some tests for the various cases in this file.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 src/hash.rs | 157 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 157 insertions(+)

diff --git a/src/hash.rs b/src/hash.rs
index a5b9493bd8..8798a50aef 100644
--- a/src/hash.rs
+++ b/src/hash.rs
@@ -10,6 +10,7 @@
 // You should have received a copy of the GNU General Public License along
 // with this program; if not, see <https://www.gnu.org/licenses/>.
 
+use std::io::{self, Write};
 use std::os::raw::c_void;
 
 pub const GIT_MAX_RAWSZ: usize = 32;
@@ -39,6 +40,81 @@ impl ObjectID {
     }
 }
 
+pub struct Hasher {
+    algo: HashAlgorithm,
+    safe: bool,
+    ctx: *mut c_void,
+}
+
+impl Hasher {
+    /// Create a new safe hasher.
+    pub fn new(algo: HashAlgorithm) -> Hasher {
+        let ctx = unsafe { c::git_hash_alloc() };
+        unsafe { c::git_hash_init(ctx, algo.hash_algo_ptr()) };
+        Hasher {
+            algo,
+            safe: true,
+            ctx,
+        }
+    }
+
+    /// Return whether this is a safe hasher.
+    pub fn is_safe(&self) -> bool {
+        self.safe
+    }
+
+    /// Update the hasher with the specified data.
+    pub fn update(&mut self, data: &[u8]) {
+        unsafe { c::git_hash_update(self.ctx, data.as_ptr() as *const c_void, data.len()) };
+    }
+
+    /// Return an object ID, consuming the hasher.
+    pub fn into_oid(self) -> ObjectID {
+        let mut oid = ObjectID {
+            hash: [0u8; 32],
+            algo: self.algo as u32,
+        };
+        unsafe { c::git_hash_final_oid(&mut oid as *mut ObjectID as *mut c_void, self.ctx) };
+        oid
+    }
+
+    /// Return a hash as a `Vec`, consuming the hasher.
+    pub fn into_vec(self) -> Vec<u8> {
+        let mut v = vec![0u8; self.algo.raw_len()];
+        unsafe { c::git_hash_final(v.as_mut_ptr(), self.ctx) };
+        v
+    }
+}
+
+impl Write for Hasher {
+    fn write(&mut self, data: &[u8]) -> io::Result<usize> {
+        self.update(data);
+        Ok(data.len())
+    }
+
+    fn flush(&mut self) -> io::Result<()> {
+        Ok(())
+    }
+}
+
+impl Clone for Hasher {
+    fn clone(&self) -> Hasher {
+        let ctx = unsafe { c::git_hash_alloc() };
+        unsafe { c::git_hash_clone(ctx, self.ctx) };
+        Hasher {
+            algo: self.algo,
+            safe: self.safe,
+            ctx,
+        }
+    }
+}
+
+impl Drop for Hasher {
+    fn drop(&mut self) {
+        unsafe { c::git_hash_free(self.ctx) };
+    }
+}
+
 /// A hash algorithm,
 #[repr(C)]
 #[derive(Debug, Copy, Clone, Ord, PartialOrd, Eq, PartialEq)]
@@ -167,6 +243,11 @@ impl HashAlgorithm {
     pub fn hash_algo_ptr(self) -> *const c_void {
         unsafe { c::hash_algo_ptr_by_offset(self as u32) }
     }
+
+    /// Create a hasher for this algorithm.
+    pub fn hasher(self) -> Hasher {
+        Hasher::new(self)
+    }
 }
 
 pub mod c {
@@ -174,5 +255,81 @@ pub mod c {
 
     extern "C" {
         pub fn hash_algo_ptr_by_offset(n: u32) -> *const c_void;
+        pub fn unsafe_hash_algo(algop: *const c_void) -> *const c_void;
+        pub fn git_hash_alloc() -> *mut c_void;
+        pub fn git_hash_free(ctx: *mut c_void);
+        pub fn git_hash_init(dst: *mut c_void, algop: *const c_void);
+        pub fn git_hash_clone(dst: *mut c_void, src: *const c_void);
+        pub fn git_hash_update(ctx: *mut c_void, inp: *const c_void, len: usize);
+        pub fn git_hash_final(hash: *mut u8, ctx: *mut c_void);
+        pub fn git_hash_final_oid(hash: *mut c_void, ctx: *mut c_void);
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use super::{HashAlgorithm, ObjectID};
+    use std::io::Write;
+
+    fn all_algos() -> &'static [HashAlgorithm] {
+        &[HashAlgorithm::SHA1, HashAlgorithm::SHA256]
+    }
+
+    #[test]
+    fn format_id_round_trips() {
+        for algo in all_algos() {
+            assert_eq!(
+                *algo,
+                HashAlgorithm::from_format_id(algo.format_id()).unwrap()
+            );
+        }
+    }
+
+    #[test]
+    fn offset_round_trips() {
+        for algo in all_algos() {
+            assert_eq!(*algo, HashAlgorithm::from_u32(*algo as u32).unwrap());
+        }
+    }
+
+    #[test]
+    fn slices_have_correct_length() {
+        for algo in all_algos() {
+            for oid in [algo.null_oid(), algo.empty_blob(), algo.empty_tree()] {
+                assert_eq!(oid.as_slice().len(), algo.raw_len());
+            }
+        }
+    }
+
+    #[test]
+    fn hasher_works_correctly() {
+        for algo in all_algos() {
+            let tests: &[(&[u8], &ObjectID)] = &[
+                (b"blob 0\0", algo.empty_blob()),
+                (b"tree 0\0", algo.empty_tree()),
+            ];
+            for (data, oid) in tests {
+                let mut h = algo.hasher();
+                assert_eq!(h.is_safe(), true);
+                // Test that this works incrementally.
+                h.update(&data[0..2]);
+                h.update(&data[2..]);
+
+                let h2 = h.clone();
+
+                let actual_oid = h.into_oid();
+                assert_eq!(**oid, actual_oid);
+
+                let v = h2.into_vec();
+                assert_eq!((*oid).as_slice(), &v);
+
+                let mut h = algo.hasher();
+                h.write_all(&data[0..2]).unwrap();
+                h.write_all(&data[2..]).unwrap();
+
+                let actual_oid = h.into_oid();
+                assert_eq!(**oid, actual_oid);
+            }
+        }
     }
 }

^ permalink raw reply related	[flat|nested] 101+ messages in thread

* Re: [PATCH 11/14] rust: add functionality to hash an object
  2025-10-27  0:44 ` [PATCH 11/14] rust: add functionality to hash an object brian m. carlson
@ 2025-10-28  9:18   ` Patrick Steinhardt
  2025-10-29  0:53     ` brian m. carlson
  2025-10-28 18:05   ` Ezekiel Newren
  1 sibling, 1 reply; 101+ messages in thread
From: Patrick Steinhardt @ 2025-10-28  9:18 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Junio C Hamano, Ezekiel Newren

On Mon, Oct 27, 2025 at 12:44:01AM +0000, brian m. carlson wrote:
> In a future commit, we'll want to hash some data when dealing with a
> loose object map.  Let's make this easy by creating a structure to hash
> objects and calling into the C functions as necessary to perform the
> hashing.  For now, we only implement safe hashing, but in the future we
> could add unsafe hashing if we want.  Implement Clone and Drop to
> appropriately manage our memory.  Additionally implement Write to make
> it easy to use with other formats that implement this trait.

What exactly do you mean with "safe" and "unsafe" hashing? Also, can't
we drop this distinction for now until we have a need for it?

> diff --git a/src/hash.rs b/src/hash.rs
> index a5b9493bd8..8798a50aef 100644
> --- a/src/hash.rs
> +++ b/src/hash.rs
> @@ -39,6 +40,81 @@ impl ObjectID {
>      }
>  }
>  
> +pub struct Hasher {
> +    algo: HashAlgorithm,
> +    safe: bool,
> +    ctx: *mut c_void,
> +}

Nit: missing documentation.

> +impl Hasher {
> +    /// Create a new safe hasher.
> +    pub fn new(algo: HashAlgorithm) -> Hasher {
> +        let ctx = unsafe { c::git_hash_alloc() };
> +        unsafe { c::git_hash_init(ctx, algo.hash_algo_ptr()) };

I already noticed this in the patch that introduced this, but wouldn't
it make sense to expose `git_hash_new()` instead of the combination of
`alloc() + init()`?

> +        Hasher {
> +            algo,
> +            safe: true,
> +            ctx,
> +        }
> +    }
> +
> +    /// Return whether this is a safe hasher.
> +    pub fn is_safe(&self) -> bool {
> +        self.safe
> +    }
> +
> +    /// Update the hasher with the specified data.
> +    pub fn update(&mut self, data: &[u8]) {
> +        unsafe { c::git_hash_update(self.ctx, data.as_ptr() as *const c_void, data.len()) };
> +    }
> +
> +    /// Return an object ID, consuming the hasher.
> +    pub fn into_oid(self) -> ObjectID {
> +        let mut oid = ObjectID {
> +            hash: [0u8; 32],
> +            algo: self.algo as u32,
> +        };
> +        unsafe { c::git_hash_final_oid(&mut oid as *mut ObjectID as *mut c_void, self.ctx) };
> +        oid
> +    }
> +
> +    /// Return a hash as a `Vec`, consuming the hasher.
> +    pub fn into_vec(self) -> Vec<u8> {
> +        let mut v = vec![0u8; self.algo.raw_len()];
> +        unsafe { c::git_hash_final(v.as_mut_ptr(), self.ctx) };
> +        v
> +    }
> +}
> +
> +impl Write for Hasher {
> +    fn write(&mut self, data: &[u8]) -> io::Result<usize> {
> +        self.update(data);
> +        Ok(data.len())
> +    }
> +
> +    fn flush(&mut self) -> io::Result<()> {
> +        Ok(())
> +    }
> +}

Yup, sensible to implement this interface.

> +impl Clone for Hasher {
> +    fn clone(&self) -> Hasher {
> +        let ctx = unsafe { c::git_hash_alloc() };
> +        unsafe { c::git_hash_clone(ctx, self.ctx) };
> +        Hasher {
> +            algo: self.algo,
> +            safe: self.safe,
> +            ctx,
> +        }
> +    }
> +}

Makes sense.

> +impl Drop for Hasher {
> +    fn drop(&mut self) {
> +        unsafe { c::git_hash_free(self.ctx) };
> +    }
> +}

Likewise.

Patrick

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 11/14] rust: add functionality to hash an object
  2025-10-28  9:18   ` Patrick Steinhardt
@ 2025-10-29  0:53     ` brian m. carlson
  2025-10-29  9:07       ` Patrick Steinhardt
  0 siblings, 1 reply; 101+ messages in thread
From: brian m. carlson @ 2025-10-29  0:53 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Junio C Hamano, Ezekiel Newren

[-- Attachment #1: Type: text/plain, Size: 2153 bytes --]

On 2025-10-28 at 09:18:26, Patrick Steinhardt wrote:
> On Mon, Oct 27, 2025 at 12:44:01AM +0000, brian m. carlson wrote:
> > In a future commit, we'll want to hash some data when dealing with a
> > loose object map.  Let's make this easy by creating a structure to hash
> > objects and calling into the C functions as necessary to perform the
> > hashing.  For now, we only implement safe hashing, but in the future we
> > could add unsafe hashing if we want.  Implement Clone and Drop to
> > appropriately manage our memory.  Additionally implement Write to make
> > it easy to use with other formats that implement this trait.
> 
> What exactly do you mean with "safe" and "unsafe" hashing? Also, can't
> we drop this distinction for now until we have a need for it?

It's from the series that Taylor introduced.  For SHA-1, safe hashing
(the default) uses SHA-1-DC, but unsafe hashing, which does not operate
on untrusted data (say, when we're writing a packfile we've created),
may use a faster algorithm.  See `git_hash_sha1_init_unsafe`.

I can omit the `safe` attribute until we need it, sure.

> > diff --git a/src/hash.rs b/src/hash.rs
> > index a5b9493bd8..8798a50aef 100644
> > --- a/src/hash.rs
> > +++ b/src/hash.rs
> > @@ -39,6 +40,81 @@ impl ObjectID {
> >      }
> >  }
> >  
> > +pub struct Hasher {
> > +    algo: HashAlgorithm,
> > +    safe: bool,
> > +    ctx: *mut c_void,
> > +}
> 
> Nit: missing documentation.

Will fix in v2.

> > +impl Hasher {
> > +    /// Create a new safe hasher.
> > +    pub fn new(algo: HashAlgorithm) -> Hasher {
> > +        let ctx = unsafe { c::git_hash_alloc() };
> > +        unsafe { c::git_hash_init(ctx, algo.hash_algo_ptr()) };
> 
> I already noticed this in the patch that introduced this, but wouldn't
> it make sense to expose `git_hash_new()` instead of the combination of
> `alloc() + init()`?

The benefit to this approach is that it allows us to reset a state in
the future if we want.  If we don't think that's necessary, I can
certainly switch to `git_hash_new` if we prefer.
-- 
brian m. carlson (they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 11/14] rust: add functionality to hash an object
  2025-10-29  0:53     ` brian m. carlson
@ 2025-10-29  9:07       ` Patrick Steinhardt
  0 siblings, 0 replies; 101+ messages in thread
From: Patrick Steinhardt @ 2025-10-29  9:07 UTC (permalink / raw)
  To: brian m. carlson, git, Junio C Hamano, Ezekiel Newren

On Wed, Oct 29, 2025 at 12:53:20AM +0000, brian m. carlson wrote:
> On 2025-10-28 at 09:18:26, Patrick Steinhardt wrote:
> > On Mon, Oct 27, 2025 at 12:44:01AM +0000, brian m. carlson wrote:
> > > In a future commit, we'll want to hash some data when dealing with a
> > > loose object map.  Let's make this easy by creating a structure to hash
> > > objects and calling into the C functions as necessary to perform the
> > > hashing.  For now, we only implement safe hashing, but in the future we
> > > could add unsafe hashing if we want.  Implement Clone and Drop to
> > > appropriately manage our memory.  Additionally implement Write to make
> > > it easy to use with other formats that implement this trait.
> > 
> > What exactly do you mean with "safe" and "unsafe" hashing? Also, can't
> > we drop this distinction for now until we have a need for it?
> 
> It's from the series that Taylor introduced.  For SHA-1, safe hashing
> (the default) uses SHA-1-DC, but unsafe hashing, which does not operate
> on untrusted data (say, when we're writing a packfile we've created),
> may use a faster algorithm.  See `git_hash_sha1_init_unsafe`.
> 
> I can omit the `safe` attribute until we need it, sure.

Ah, I completely forgot about that distinction! Makes sense.

> > > +impl Hasher {
> > > +    /// Create a new safe hasher.
> > > +    pub fn new(algo: HashAlgorithm) -> Hasher {
> > > +        let ctx = unsafe { c::git_hash_alloc() };
> > > +        unsafe { c::git_hash_init(ctx, algo.hash_algo_ptr()) };
> > 
> > I already noticed this in the patch that introduced this, but wouldn't
> > it make sense to expose `git_hash_new()` instead of the combination of
> > `alloc() + init()`?
> 
> The benefit to this approach is that it allows us to reset a state in
> the future if we want.  If we don't think that's necessary, I can
> certainly switch to `git_hash_new` if we prefer.

Hm, fair. I don't mind it much either way.

Patrick

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 11/14] rust: add functionality to hash an object
  2025-10-27  0:44 ` [PATCH 11/14] rust: add functionality to hash an object brian m. carlson
  2025-10-28  9:18   ` Patrick Steinhardt
@ 2025-10-28 18:05   ` Ezekiel Newren
  2025-10-29  1:05     ` brian m. carlson
  1 sibling, 1 reply; 101+ messages in thread
From: Ezekiel Newren @ 2025-10-28 18:05 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Junio C Hamano, Patrick Steinhardt

On Sun, Oct 26, 2025 at 6:44 PM brian m. carlson
<sandals@crustytoothpaste.net> wrote:
>
> In a future commit, we'll want to hash some data when dealing with a
> loose object map.  Let's make this easy by creating a structure to hash
> objects and calling into the C functions as necessary to perform the
> hashing.  For now, we only implement safe hashing, but in the future we
> could add unsafe hashing if we want.  Implement Clone and Drop to
> appropriately manage our memory.  Additionally implement Write to make
> it easy to use with other formats that implement this trait.
>
> While we're at it, add some tests for the various cases in this file.
>
> Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
> ---
>  src/hash.rs | 157 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 157 insertions(+)
>
> diff --git a/src/hash.rs b/src/hash.rs
> index a5b9493bd8..8798a50aef 100644
> --- a/src/hash.rs
> +++ b/src/hash.rs
> @@ -10,6 +10,7 @@
>  // You should have received a copy of the GNU General Public License along
>  // with this program; if not, see <https://www.gnu.org/licenses/>.
>
> +use std::io::{self, Write};
>  use std::os::raw::c_void;
>
>  pub const GIT_MAX_RAWSZ: usize = 32;
> @@ -39,6 +40,81 @@ impl ObjectID {
>      }
>  }
>
> +pub struct Hasher {
> +    algo: HashAlgorithm,
> +    safe: bool,
> +    ctx: *mut c_void,
> +}

The name _Hasher_ is already used by std::hash::Hasher. It would be
preferable to pick a different name to avoid confusion. Perhaps
CryptoHasher, SecureHasher?

> +impl Hasher {
> +    /// Create a new safe hasher.
> +    pub fn new(algo: HashAlgorithm) -> Hasher {
> +        let ctx = unsafe { c::git_hash_alloc() };
> +        unsafe { c::git_hash_init(ctx, algo.hash_algo_ptr()) };
> +        Hasher {
> +            algo,
> +            safe: true,
> +            ctx,
> +        }
> +    }

-    pub fn new(algo: HashAlgorithm) -> Hasher {
+    pub fn new(algo: HashAlgorithm) -> Self {
         let ctx = unsafe { c::git_hash_alloc() };
         unsafe { c::git_hash_init(ctx, algo.hash_algo_ptr()) };
-        Hasher {
+        Self {
            algo,
            safe: true,
            ctx,
        }

> +    /// Return whether this is a safe hasher.
> +    pub fn is_safe(&self) -> bool {
> +        self.safe
> +    }

I don't understand the point in being able to query whether a given
hasher is safe or not. How does that change how this hasher code is
used? If the functions are safe then you wouldn't wrap it in an unsafe
block. If the functions are declared with unsafe then you'd always
need to wrap it in an unsafe block whether it's actually safe or not.
Using unsafe in Rust isn't like error handling where you do something
different on failure. If something fails in unsafe it's usually
unrecoverable e.g. segfault due to invalid memory access. My
understanding of unsafe in Rust means "The compiler can't verify that
this code is actually safe to run, so I've made sure that it is safe
myself and I'll let the compiler know what code to ignore during
compilation."

> +    /// Update the hasher with the specified data.
> +    pub fn update(&mut self, data: &[u8]) {
> +        unsafe { c::git_hash_update(self.ctx, data.as_ptr() as *const c_void, data.len()) };
> +    }
> +
> +    /// Return an object ID, consuming the hasher.
> +    pub fn into_oid(self) -> ObjectID {
> +        let mut oid = ObjectID {
> +            hash: [0u8; 32],
> +            algo: self.algo as u32,
> +        };
> +        unsafe { c::git_hash_final_oid(&mut oid as *mut ObjectID as *mut c_void, self.ctx) };
> +        oid
> +    }
> +
> +    /// Return a hash as a `Vec`, consuming the hasher.
> +    pub fn into_vec(self) -> Vec<u8> {
> +        let mut v = vec![0u8; self.algo.raw_len()];
> +        unsafe { c::git_hash_final(v.as_mut_ptr(), self.ctx) };
> +        v
> +    }
> +}
> +
> +impl Write for Hasher {
> +    fn write(&mut self, data: &[u8]) -> io::Result<usize> {
> +        self.update(data);
> +        Ok(data.len())
> +    }
> +
> +    fn flush(&mut self) -> io::Result<()> {
> +        Ok(())
> +    }
> +}
> +
> +impl Clone for Hasher {
> +    fn clone(&self) -> Hasher {
> +        let ctx = unsafe { c::git_hash_alloc() };
> +        unsafe { c::git_hash_clone(ctx, self.ctx) };
> +        Hasher {
> +            algo: self.algo,
> +            safe: self.safe,
> +            ctx,
> +        }
> +    }
> +}
> +
> +impl Drop for Hasher {
> +    fn drop(&mut self) {
> +        unsafe { c::git_hash_free(self.ctx) };
> +    }
> +}

Make sense.

>  /// A hash algorithm,
>  #[repr(C)]
>  #[derive(Debug, Copy, Clone, Ord, PartialOrd, Eq, PartialEq)]
> @@ -167,6 +243,11 @@ impl HashAlgorithm {
>      pub fn hash_algo_ptr(self) -> *const c_void {
>          unsafe { c::hash_algo_ptr_by_offset(self as u32) }
>      }
> +
> +    /// Create a hasher for this algorithm.
> +    pub fn hasher(self) -> Hasher {
> +        Hasher::new(self)
> +    }
>  }
>
>  pub mod c {
> @@ -174,5 +255,81 @@ pub mod c {
>
>      extern "C" {
>          pub fn hash_algo_ptr_by_offset(n: u32) -> *const c_void;
> +        pub fn unsafe_hash_algo(algop: *const c_void) -> *const c_void;
> +        pub fn git_hash_alloc() -> *mut c_void;
> +        pub fn git_hash_free(ctx: *mut c_void);
> +        pub fn git_hash_init(dst: *mut c_void, algop: *const c_void);
> +        pub fn git_hash_clone(dst: *mut c_void, src: *const c_void);
> +        pub fn git_hash_update(ctx: *mut c_void, inp: *const c_void, len: usize);
> +        pub fn git_hash_final(hash: *mut u8, ctx: *mut c_void);
> +        pub fn git_hash_final_oid(hash: *mut c_void, ctx: *mut c_void);
> +    }
> +}
> +
> +#[cfg(test)]
> +mod tests {
> +    use super::{HashAlgorithm, ObjectID};
> +    use std::io::Write;
> +
> +    fn all_algos() -> &'static [HashAlgorithm] {
> +        &[HashAlgorithm::SHA1, HashAlgorithm::SHA256]
> +    }
> +
> +    #[test]
> +    fn format_id_round_trips() {
> +        for algo in all_algos() {
> +            assert_eq!(
> +                *algo,
> +                HashAlgorithm::from_format_id(algo.format_id()).unwrap()
> +            );
> +        }
> +    }
> +
> +    #[test]
> +    fn offset_round_trips() {
> +        for algo in all_algos() {
> +            assert_eq!(*algo, HashAlgorithm::from_u32(*algo as u32).unwrap());
> +        }
> +    }
> +
> +    #[test]
> +    fn slices_have_correct_length() {
> +        for algo in all_algos() {
> +            for oid in [algo.null_oid(), algo.empty_blob(), algo.empty_tree()] {
> +                assert_eq!(oid.as_slice().len(), algo.raw_len());
> +            }
> +        }
> +    }
> +
> +    #[test]
> +    fn hasher_works_correctly() {
> +        for algo in all_algos() {
> +            let tests: &[(&[u8], &ObjectID)] = &[
> +                (b"blob 0\0", algo.empty_blob()),
> +                (b"tree 0\0", algo.empty_tree()),
> +            ];
> +            for (data, oid) in tests {
> +                let mut h = algo.hasher();
> +                assert_eq!(h.is_safe(), true);
> +                // Test that this works incrementally.
> +                h.update(&data[0..2]);
> +                h.update(&data[2..]);
> +
> +                let h2 = h.clone();
> +
> +                let actual_oid = h.into_oid();
> +                assert_eq!(**oid, actual_oid);
> +
> +                let v = h2.into_vec();
> +                assert_eq!((*oid).as_slice(), &v);
> +
> +                let mut h = algo.hasher();
> +                h.write_all(&data[0..2]).unwrap();
> +                h.write_all(&data[2..]).unwrap();
> +
> +                let actual_oid = h.into_oid();
> +                assert_eq!(**oid, actual_oid);
> +            }
> +        }
>      }
>  }

Looks good.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 11/14] rust: add functionality to hash an object
  2025-10-28 18:05   ` Ezekiel Newren
@ 2025-10-29  1:05     ` brian m. carlson
  2025-10-29 16:02       ` Ben Knoble
  0 siblings, 1 reply; 101+ messages in thread
From: brian m. carlson @ 2025-10-29  1:05 UTC (permalink / raw)
  To: Ezekiel Newren; +Cc: git, Junio C Hamano, Patrick Steinhardt

[-- Attachment #1: Type: text/plain, Size: 2422 bytes --]

On 2025-10-28 at 18:05:59, Ezekiel Newren wrote:
> The name _Hasher_ is already used by std::hash::Hasher. It would be
> preferable to pick a different name to avoid confusion. Perhaps
> CryptoHasher, SecureHasher?

Sure, I can pick a different name if you like.  There are also myriad
`Result` values in Rust: `std::result::Result`, `std::fmt::Result`,
`std::io::Result`, etc., so I don't see a huge problem with it, but as I
said, I can change it if folks prefer.

> I don't understand the point in being able to query whether a given
> hasher is safe or not. How does that change how this hasher code is
> used? If the functions are safe then you wouldn't wrap it in an unsafe
> block. If the functions are declared with unsafe then you'd always
> need to wrap it in an unsafe block whether it's actually safe or not.
> Using unsafe in Rust isn't like error handling where you do something
> different on failure. If something fails in unsafe it's usually
> unrecoverable e.g. segfault due to invalid memory access. My
> understanding of unsafe in Rust means "The compiler can't verify that
> this code is actually safe to run, so I've made sure that it is safe
> myself and I'll let the compiler know what code to ignore during
> compilation."

This is not like `unsafe` in Rust.  We have some SHA-1 functions that
are safe (the default ones) that use SHA-1-DC to detect collisions.
People may also compile their Git version with a faster version of SHA-1
that doesn't detect collisions and that may use hardware acceleration in
cases where we're not dealing with untrusted data.  Taylor benchmarked
it and got some pretty nice performance improvements.

My preference personally was to simply say, "SHA-1 is slow since it's
insecure; use SHA-256 if you want hardware acceleration and good
performance," but my advice was not heeded.

So this allows us to do something like `assert!(hash.is_safe())` in
certain code where we know we have untrusted data to make sure we
haven't been passed a Hasher that has been incorrectly initialized.  We
have some code paths which can accept either (and, depending on which
mode they're operating in, do or don't need a safe hasher), so separate
types are less convenient.  We could do that, however, but it would make
things more complicated and we'd need a trait that covers both.
-- 
brian m. carlson (they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 11/14] rust: add functionality to hash an object
  2025-10-29  1:05     ` brian m. carlson
@ 2025-10-29 16:02       ` Ben Knoble
  0 siblings, 0 replies; 101+ messages in thread
From: Ben Knoble @ 2025-10-29 16:02 UTC (permalink / raw)
  To: brian m. carlson; +Cc: Ezekiel Newren, git, Junio C Hamano, Patrick Steinhardt

> Le 28 oct. 2025 à 21:06, brian m. carlson <sandals@crustytoothpaste.net> a écrit :
> 
> On 2025-10-28 at 18:05:59, Ezekiel Newren wrote:
>> The name _Hasher_ is already used by std::hash::Hasher. It would be
>> preferable to pick a different name to avoid confusion. Perhaps
>> CryptoHasher, SecureHasher?
> 
> Sure, I can pick a different name if you like.  There are also myriad
> `Result` values in Rust: `std::result::Result`, `std::fmt::Result`,
> `std::io::Result`, etc., so I don't see a huge problem with it, but as I
> said, I can change it if folks prefer.
> 
>> I don't understand the point in being able to query whether a given
>> hasher is safe or not. How does that change how this hasher code is
>> used? If the functions are safe then you wouldn't wrap it in an unsafe
>> block. If the functions are declared with unsafe then you'd always
>> need to wrap it in an unsafe block whether it's actually safe or not.
>> Using unsafe in Rust isn't like error handling where you do something
>> different on failure. If something fails in unsafe it's usually
>> unrecoverable e.g. segfault due to invalid memory access. My
>> understanding of unsafe in Rust means "The compiler can't verify that
>> this code is actually safe to run, so I've made sure that it is safe
>> myself and I'll let the compiler know what code to ignore during
>> compilation."
> 
> This is not like `unsafe` in Rust.  We have some SHA-1 functions that
> are safe (the default ones) that use SHA-1-DC to detect collisions.
> People may also compile their Git version with a faster version of SHA-1
> that doesn't detect collisions and that may use hardware acceleration in
> cases where we're not dealing with untrusted data.  Taylor benchmarked
> it and got some pretty nice performance improvements.
> 
> My preference personally was to simply say, "SHA-1 is slow since it's
> insecure; use SHA-256 if you want hardware acceleration and good
> performance," but my advice was not heeded.
> 
> So this allows us to do something like `assert!(hash.is_safe())` in
> certain code where we know we have untrusted data to make sure we
> haven't been passed a Hasher that has been incorrectly initialized.  We
> have some code paths which can accept either (and, depending on which
> mode they're operating in, do or don't need a safe hasher), so separate
> types are less convenient.  We could do that, however, but it would make
> things more complicated and we'd need a trait that covers both.
> --
> brian m. carlson (they/them)
> Toronto, Ontario, CA
> <signature.asc>

Given the confusion on the names, perhaps some docs in the code helps? Or maybe it’s already doc’d over by the FFI type, in which case a note may suffice—

    “Safe” here is about the hashing algorithm and (un)trusted data, not Rust memory safety. See XYZ for more details.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH 12/14] rust: add a new binary loose object map format
  2025-10-27  0:43 [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 brian m. carlson
                   ` (10 preceding siblings ...)
  2025-10-27  0:44 ` [PATCH 11/14] rust: add functionality to hash an object brian m. carlson
@ 2025-10-27  0:44 ` brian m. carlson
  2025-10-28  9:18   ` Patrick Steinhardt
                     ` (2 more replies)
  2025-10-27  0:44 ` [PATCH 13/14] rust: add a small wrapper around the hashfile code brian m. carlson
                   ` (5 subsequent siblings)
  17 siblings, 3 replies; 101+ messages in thread
From: brian m. carlson @ 2025-10-27  0:44 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren

Our current loose object format has a few problems.  First, it is not
efficient: the list of object IDs is not sorted and even if it were,
there would not be an efficient way to look up objects in both
algorithms.

Second, we need to store mappings for things which are not technically
loose objects but are not packed objects, either, and so cannot be
stored in a pack index.  These kinds of things include shallows, their
parents, and their trees, as well as submodules. Yet we also need to
implement a sensible way to store the kind of object so that we can
prune unneeded entries.  For instance, if the user has updated the
shallows, we can remove the old values.

For these reasons, introduce a new binary loose object map format.  The
careful reader will notice that it resembles very closely the pack index
v3 format.  Add an in-memory loose object map as well, and allow
enabling writing to a batched map, which can then be written later as
one of the binary loose object maps.  Include several tests for round
tripping and data lookup across algorithms.

Note that the use of this code elsewhere in Git will involve some C code
and some C-compatible code in Rust that will be introduced in a future
commit.  Thus, for example, we ignore the fact that if there is no
current batch and the caller asks for data to be written, this code does
nothing, mostly because this code also does not involve itself with
opening or manipulating files.  The C code that we will add later will
implement this functionality at a higher level and take care of this,
since the code which is necessary for writing to the object store is
deeply involved with our C abstractions and it would require extensive
work (which would not be especially valuable at this point) to port
those to Rust.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 Documentation/gitformat-loose.adoc | 104 ++++
 Makefile                           |   1 +
 src/lib.rs                         |   1 +
 src/loose.rs                       | 912 +++++++++++++++++++++++++++++
 src/meson.build                    |   1 +
 5 files changed, 1019 insertions(+)
 create mode 100644 src/loose.rs

diff --git a/Documentation/gitformat-loose.adoc b/Documentation/gitformat-loose.adoc
index 947993663e..4850c91669 100644
--- a/Documentation/gitformat-loose.adoc
+++ b/Documentation/gitformat-loose.adoc
@@ -10,6 +10,8 @@ SYNOPSIS
 --------
 [verse]
 $GIT_DIR/objects/[0-9a-f][0-9a-f]/*
+$GIT_DIR/objects/loose-object-idx
+$GIT_DIR/objects/loose-map/map-*.map
 
 DESCRIPTION
 -----------
@@ -48,6 +50,108 @@ stored under
 Similarly, a blob containing the contents `abc` would have the uncompressed
 data of `blob 3\0abc`.
 
+== Loose object mapping
+
+When the `compatObjectFormat` option is used, Git needs to store a mapping
+between the repository's main algorithm and the compatibility algorithm. There
+are two formats for this: the legacy mapping and the modern mapping.
+
+=== Legacy mapping
+
+The compatibility mapping is stored in a file called
+`$GIT_DIR/objects/loose-object-idx`.  The format of this file looks like this:
+
+  # loose-object-idx
+  (main-name SP compat-name LF)*
+
+`main-name` refers to hexadecimal object ID of the object in the main
+repository format and `compat-name` refers to the same thing, but for the
+compatibility format.
+
+This format is read if it exists but is not written.
+
+Note that carriage returns are not permitted in this file, regardless of the
+host system or configuration.
+
+=== Modern mapping
+
+The modern mapping consists of a set of files under `$GIT_DIR/objects/loose`
+ending in `.map`.  The portion of the filename before the extension is that of
+the hash checksum in hex format.
+
+`git pack-objects` will repack existing entries into one file, removing any
+unnecessary objects, such as obsolete shallow entries or loose objects that
+have been packed.
+
+==== Mapping file format
+
+- A header appears at the beginning and consists of the following:
+	* A 4-byte mapping signature: `LMAP`
+	* 4-byte version number: 1
+	* 4-byte length of the header section.
+	* 4-byte number of objects declared in this map file.
+	* 4-byte number of object formats declared in this map file.
+  * For each object format:
+    ** 4-byte format identifier (e.g., `sha1` for SHA-1)
+    ** 4-byte length in bytes of shortened object names. This is the
+      shortest possible length needed to make names in the shortened
+      object name table unambiguous.
+    ** 8-byte integer, recording where tables relating to this format
+      are stored in this index file, as an offset from the beginning.
+  * 8-byte offset to the trailer from the beginning of this file.
+	* Zero or more additional key/value pairs (4-byte key, 4-byte value), which
+		may optionally declare one or more chunks.  No chunks are currently
+		defined. Readers must ignore unrecognized keys.
+- Zero or more NUL bytes.  These are used to improve the alignment of the
+	4-byte quantities below.
+- Tables for the first object format:
+	* A sorted table of shortened object names.  These are prefixes of the names
+		of all objects in this file, packed together without offset values to
+		reduce the cache footprint of the binary search for a specific object name.
+  * A sorted table of full object names.
+	* A table of 4-byte metadata values.
+	* Zero or more chunks.  A chunk starts with a four-byte chunk identifier and
+		a four-byte parameter (which, if unneeded, is all zeros) and an eight-byte
+		size (not including the identifier, parameter, or size), plus the chunk
+		data.
+- Zero or more NUL bytes.
+- Tables for subsequent object formats:
+	* A sorted table of shortened object names.  These are prefixes of the names
+		of all objects in this file, packed together without offset values to
+		reduce the cache footprint of the binary search for a specific object name.
+  * A table of full object names in the order specified by the first object format.
+	* A table of 4-byte values mapping object name order to the order of the
+		first object format. For an object in the table of sorted shortened object
+		names, the value at the corresponding index in this table is the index in
+		the previous table for that same object.
+	* Zero or more NUL bytes.
+- The trailer consists of the following:
+  * Hash checksum of all of the above.
+
+The lower six bits of each metadata table contain a type field indicating the
+reason that this object is stored:
+
+0::
+	Reserved.
+1::
+	This object is stored as a loose object in the repository.
+2::
+	This object is a shallow entry.  The mapping refers to a shallow value
+	returned by a remote server.
+3::
+	This object is a submodule entry.  The mapping refers to the commit stored
+	representing a submodule.
+
+Other data may be stored in this field in the future.  Bits that are not used
+must be zero.
+
+All 4-byte numbers are in network order and must be 4-byte aligned in the file,
+so the NUL padding may be required in some cases.
+
+Note that the hash at the end of the file is in whatever the repository's main
+algorithm is.  In the usual case when there are multiple algorithms, the main
+algorithm will be SHA-256 and the compatibility algorithm will be SHA-1.
+
 GIT
 ---
 Part of the linkgit:git[1] suite
diff --git a/Makefile b/Makefile
index 7c36302717..2081b13780 100644
--- a/Makefile
+++ b/Makefile
@@ -1523,6 +1523,7 @@ UNIT_TEST_OBJS += $(UNIT_TEST_DIR)/test-lib.o
 
 RUST_SOURCES += src/hash.rs
 RUST_SOURCES += src/lib.rs
+RUST_SOURCES += src/loose.rs
 RUST_SOURCES += src/varint.rs
 
 GIT-VERSION-FILE: FORCE
diff --git a/src/lib.rs b/src/lib.rs
index cf7c962509..442f9433dc 100644
--- a/src/lib.rs
+++ b/src/lib.rs
@@ -1,2 +1,3 @@
 pub mod hash;
+pub mod loose;
 pub mod varint;
diff --git a/src/loose.rs b/src/loose.rs
new file mode 100644
index 0000000000..a4e7d2fa48
--- /dev/null
+++ b/src/loose.rs
@@ -0,0 +1,912 @@
+// This program is free software; you can redistribute it and/or modify
+// it under the terms of the GNU General Public License as published by
+// the Free Software Foundation: version 2 of the License, dated June 1991.
+//
+// This program is distributed in the hope that it will be useful,
+// but WITHOUT ANY WARRANTY; without even the implied warranty of
+// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+// GNU General Public License for more details.
+//
+// You should have received a copy of the GNU General Public License along
+// with this program; if not, see <https://www.gnu.org/licenses/>.
+
+use crate::hash::{HashAlgorithm, ObjectID, GIT_MAX_RAWSZ};
+use std::collections::BTreeMap;
+use std::convert::TryInto;
+use std::io::{self, Write};
+
+/// The type of object stored in the map.
+///
+/// If this value is `Reserved`, then it is never written to disk and is used primarily to store
+/// certain hard-coded objects, like the empty tree, empty blob, or null object ID.
+///
+/// If this value is `LooseObject`, then this represents a loose object.  `Shallow` represents a
+/// shallow commit, its parent, or its tree.  `Submodule` represents a submodule commit.
+#[repr(C)]
+#[derive(Debug, Clone, Copy, Ord, PartialOrd, Eq, PartialEq)]
+pub enum MapType {
+    Reserved = 0,
+    LooseObject = 1,
+    Shallow = 2,
+    Submodule = 3,
+}
+
+impl MapType {
+    pub fn from_u32(n: u32) -> Option<MapType> {
+        match n {
+            0 => Some(Self::Reserved),
+            1 => Some(Self::LooseObject),
+            2 => Some(Self::Shallow),
+            3 => Some(Self::Submodule),
+            _ => None,
+        }
+    }
+}
+
+/// The value of an object stored in a `LooseObjectMemoryMap`.
+///
+/// This keeps the object ID to which the key is mapped and its kind together.
+struct MappedObject {
+    oid: ObjectID,
+    kind: MapType,
+}
+
+/// Memory storage for a loose object.
+struct LooseObjectMemoryMap {
+    to_compat: BTreeMap<ObjectID, MappedObject>,
+    to_storage: BTreeMap<ObjectID, MappedObject>,
+    compat: HashAlgorithm,
+    storage: HashAlgorithm,
+}
+
+impl LooseObjectMemoryMap {
+    /// Create a new `LooseObjectMemoryMap`.
+    ///
+    /// The storage and compatibility `HashAlgorithm` instances are used to store the object IDs in
+    /// the correct map.
+    fn new(storage: HashAlgorithm, compat: HashAlgorithm) -> LooseObjectMemoryMap {
+        LooseObjectMemoryMap {
+            to_compat: BTreeMap::new(),
+            to_storage: BTreeMap::new(),
+            compat,
+            storage,
+        }
+    }
+
+    fn len(&self) -> usize {
+        self.to_compat.len()
+    }
+
+    /// Write this map to an interface implementing `std::io::Write`.
+    fn write<W: Write>(&self, wrtr: W) -> io::Result<()> {
+        const VERSION_NUMBER: u32 = 1;
+        const NUM_OBJECT_FORMATS: u32 = 2;
+        const PADDING: [u8; 4] = [0u8; 4];
+
+        let mut wrtr = wrtr;
+        let header_size: u32 = 4 + 4 + 4 + 4 + 4 + (4 + 4 + 8) * 2 + 8;
+
+        wrtr.write_all(b"LMAP")?;
+        wrtr.write_all(&VERSION_NUMBER.to_be_bytes())?;
+        wrtr.write_all(&header_size.to_be_bytes())?;
+        wrtr.write_all(&(self.to_compat.len() as u32).to_be_bytes())?;
+        wrtr.write_all(&NUM_OBJECT_FORMATS.to_be_bytes())?;
+
+        let storage_short_len = self.find_short_name_len(&self.to_compat, self.storage);
+        let compat_short_len = self.find_short_name_len(&self.to_storage, self.compat);
+
+        let storage_npadding = Self::required_nul_padding(self.to_compat.len(), storage_short_len);
+        let compat_npadding = Self::required_nul_padding(self.to_compat.len(), compat_short_len);
+
+        let mut offset: u64 = header_size as u64;
+
+        for (algo, len, npadding) in &[
+            (self.storage, storage_short_len, storage_npadding),
+            (self.compat, compat_short_len, compat_npadding),
+        ] {
+            wrtr.write_all(&algo.format_id().to_be_bytes())?;
+            wrtr.write_all(&(*len as u32).to_be_bytes())?;
+
+            offset += *npadding;
+            wrtr.write_all(&offset.to_be_bytes())?;
+
+            offset += self.to_compat.len() as u64 * (*len as u64 + algo.raw_len() as u64 + 4);
+        }
+
+        wrtr.write_all(&offset.to_be_bytes())?;
+
+        let order_map: BTreeMap<&ObjectID, usize> = self
+            .to_compat
+            .keys()
+            .enumerate()
+            .map(|(i, oid)| (oid, i))
+            .collect();
+
+        wrtr.write_all(&PADDING[0..storage_npadding as usize])?;
+        for oid in self.to_compat.keys() {
+            wrtr.write_all(&oid.as_slice()[0..storage_short_len])?;
+        }
+        for oid in self.to_compat.keys() {
+            wrtr.write_all(oid.as_slice())?;
+        }
+        for meta in self.to_compat.values() {
+            wrtr.write_all(&(meta.kind as u32).to_be_bytes())?;
+        }
+
+        wrtr.write_all(&PADDING[0..compat_npadding as usize])?;
+        for oid in self.to_storage.keys() {
+            wrtr.write_all(&oid.as_slice()[0..compat_short_len])?;
+        }
+        for meta in self.to_compat.values() {
+            wrtr.write_all(meta.oid.as_slice())?;
+        }
+        for meta in self.to_storage.values() {
+            wrtr.write_all(&(order_map[&meta.oid] as u32).to_be_bytes())?;
+        }
+
+        Ok(())
+    }
+
+    fn required_nul_padding(nitems: usize, short_len: usize) -> u64 {
+        let shortened_table_len = nitems as u64 * short_len as u64;
+        let misalignment = shortened_table_len & 3;
+        // If the value is 0, return 0; otherwise, return the difference from 4.
+        (4 - misalignment) & 3
+    }
+
+    fn last_matching_offset(a: &ObjectID, b: &ObjectID, algop: HashAlgorithm) -> usize {
+        for i in 0..=algop.raw_len() {
+            if a.hash[i] != b.hash[i] {
+                return i;
+            }
+        }
+        algop.raw_len()
+    }
+
+    fn find_short_name_len(
+        &self,
+        map: &BTreeMap<ObjectID, MappedObject>,
+        algop: HashAlgorithm,
+    ) -> usize {
+        if map.len() <= 1 {
+            return 1;
+        }
+        let mut len = 1;
+        let mut iter = map.keys();
+        let mut cur = match iter.next() {
+            Some(cur) => cur,
+            None => return len,
+        };
+        for item in iter {
+            let offset = Self::last_matching_offset(cur, item, algop);
+            if offset >= len {
+                len = offset + 1;
+            }
+            cur = item;
+        }
+        if len > algop.raw_len() {
+            algop.raw_len()
+        } else {
+            len
+        }
+    }
+}
+
+struct ObjectFormatData {
+    data_off: usize,
+    shortened_len: usize,
+    full_off: usize,
+    mapping_off: Option<usize>,
+}
+
+pub struct MmapedLooseObjectMapIter<'a> {
+    offset: usize,
+    algos: Vec<HashAlgorithm>,
+    source: &'a MmapedLooseObjectMap<'a>,
+}
+
+impl<'a> Iterator for MmapedLooseObjectMapIter<'a> {
+    type Item = Vec<ObjectID>;
+
+    fn next(&mut self) -> Option<Self::Item> {
+        if self.offset >= self.source.nitems {
+            return None;
+        }
+        let offset = self.offset;
+        self.offset += 1;
+        let v: Vec<ObjectID> = self
+            .algos
+            .iter()
+            .cloned()
+            .filter_map(|algo| self.source.oid_from_offset(offset, algo))
+            .collect();
+        if v.len() != self.algos.len() {
+            return None;
+        }
+        Some(v)
+    }
+}
+
+#[allow(dead_code)]
+pub struct MmapedLooseObjectMap<'a> {
+    memory: &'a [u8],
+    nitems: usize,
+    meta_off: usize,
+    obj_formats: BTreeMap<HashAlgorithm, ObjectFormatData>,
+    main_algo: HashAlgorithm,
+}
+
+#[derive(Debug)]
+#[allow(dead_code)]
+enum MmapedParseError {
+    HeaderTooSmall,
+    InvalidSignature,
+    InvalidVersion,
+    UnknownAlgorithm,
+    OffsetTooLarge,
+    TooFewObjectFormats,
+    UnalignedData,
+    InvalidTrailerOffset,
+}
+
+#[allow(dead_code)]
+impl<'a> MmapedLooseObjectMap<'a> {
+    fn new(
+        slice: &'a [u8],
+        hash_algo: HashAlgorithm,
+    ) -> Result<MmapedLooseObjectMap<'a>, MmapedParseError> {
+        let object_format_header_size = 4 + 4 + 8;
+        let trailer_offset_size = 8;
+        let header_size: usize =
+            4 + 4 + 4 + 4 + 4 + object_format_header_size * 2 + trailer_offset_size;
+        if slice.len() < header_size {
+            return Err(MmapedParseError::HeaderTooSmall);
+        }
+        if slice[0..4] != *b"LMAP" {
+            return Err(MmapedParseError::InvalidSignature);
+        }
+        if Self::u32_at_offset(slice, 4) != 1 {
+            return Err(MmapedParseError::InvalidVersion);
+        }
+        let _ = Self::u32_at_offset(slice, 8) as usize;
+        let nitems = Self::u32_at_offset(slice, 12) as usize;
+        let nobj_formats = Self::u32_at_offset(slice, 16) as usize;
+        if nobj_formats < 2 {
+            return Err(MmapedParseError::TooFewObjectFormats);
+        }
+        let mut offset = 20;
+        let mut meta_off = None;
+        let mut data = BTreeMap::new();
+        for i in 0..nobj_formats {
+            if offset + object_format_header_size + trailer_offset_size > slice.len() {
+                return Err(MmapedParseError::HeaderTooSmall);
+            }
+            let format_id = Self::u32_at_offset(slice, offset);
+            let shortened_len = Self::u32_at_offset(slice, offset + 4) as usize;
+            let data_off = Self::u64_at_offset(slice, offset + 8);
+
+            let algo = HashAlgorithm::from_format_id(format_id)
+                .ok_or(MmapedParseError::UnknownAlgorithm)?;
+            let data_off: usize = data_off
+                .try_into()
+                .map_err(|_| MmapedParseError::OffsetTooLarge)?;
+
+            // Every object format must have these entries.
+            let shortened_table_len = shortened_len
+                .checked_mul(nitems)
+                .ok_or(MmapedParseError::OffsetTooLarge)?;
+            let full_off = data_off
+                .checked_add(shortened_table_len)
+                .ok_or(MmapedParseError::OffsetTooLarge)?;
+            Self::verify_aligned(full_off)?;
+            Self::verify_valid(slice, full_off as u64)?;
+
+            let full_length = algo
+                .raw_len()
+                .checked_mul(nitems)
+                .ok_or(MmapedParseError::OffsetTooLarge)?;
+            let off = full_length
+                .checked_add(full_off)
+                .ok_or(MmapedParseError::OffsetTooLarge)?;
+            Self::verify_aligned(off)?;
+            Self::verify_valid(slice, off as u64)?;
+
+            // This is for the metadata for the first object format and for the order mapping for
+            // other object formats.
+            let meta_size = nitems
+                .checked_mul(4)
+                .ok_or(MmapedParseError::OffsetTooLarge)?;
+            let meta_end = off
+                .checked_add(meta_size)
+                .ok_or(MmapedParseError::OffsetTooLarge)?;
+            Self::verify_valid(slice, meta_end as u64)?;
+
+            let mut mapping_off = None;
+            if i == 0 {
+                meta_off = Some(off);
+            } else {
+                mapping_off = Some(off);
+            }
+
+            data.insert(
+                algo,
+                ObjectFormatData {
+                    data_off,
+                    shortened_len,
+                    full_off,
+                    mapping_off,
+                },
+            );
+            offset += object_format_header_size;
+        }
+        let trailer = Self::u64_at_offset(slice, offset);
+        Self::verify_aligned(trailer as usize)?;
+        Self::verify_valid(slice, trailer)?;
+        let end = trailer
+            .checked_add(hash_algo.raw_len() as u64)
+            .ok_or(MmapedParseError::OffsetTooLarge)?;
+        if end != slice.len() as u64 {
+            return Err(MmapedParseError::InvalidTrailerOffset);
+        }
+        match meta_off {
+            Some(meta_off) => Ok(MmapedLooseObjectMap {
+                memory: slice,
+                nitems,
+                meta_off,
+                obj_formats: data,
+                main_algo: hash_algo,
+            }),
+            None => Err(MmapedParseError::TooFewObjectFormats),
+        }
+    }
+
+    fn iter(&self) -> MmapedLooseObjectMapIter<'_> {
+        let mut algos = Vec::with_capacity(self.obj_formats.len());
+        algos.push(self.main_algo);
+        for algo in self.obj_formats.keys().cloned() {
+            if algo != self.main_algo {
+                algos.push(algo);
+            }
+        }
+        MmapedLooseObjectMapIter {
+            offset: 0,
+            algos,
+            source: self,
+        }
+    }
+
+    /// Treats `sl` as if it were a set of slices of `wanted.len()` bytes, and searches for
+    /// `wanted` within it.
+    ///
+    /// If found, returns the offset of the subslice in `sl`.
+    ///
+    /// ```
+    /// let sl = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9];
+    ///
+    /// assert_eq!(MmapedLooseObjectMap::binary_search_slice(sl, &[2, 3]), Some(1));
+    /// assert_eq!(MmapedLooseObjectMap::binary_search_slice(sl, &[6, 7]), Some(4));
+    /// assert_eq!(MmapedLooseObjectMap::binary_search_slice(sl, &[1, 2]), None);
+    /// assert_eq!(MmapedLooseObjectMap::binary_search_slice(sl, &[10, 20]), None);
+    /// ```
+    fn binary_search_slice(sl: &[u8], wanted: &[u8]) -> Option<usize> {
+        let len = wanted.len();
+        let res = sl.binary_search_by(|item| {
+            // We would like element_offset, but that is currently nightly only.  Instead, do a
+            // pointer subtraction to find the index.
+            let index = unsafe { (item as *const u8).offset_from(sl.as_ptr()) } as usize;
+            // Now we have the index of this object.  Round it down to the nearest full-sized
+            // chunk to find the actual offset where this starts.
+            let index = index - (index % len);
+            // Compute the comparison of that value instead, which will provide the expected
+            // result.
+            sl[index..index + wanted.len()].cmp(wanted)
+        });
+        res.ok().map(|offset| offset / len)
+    }
+
+    /// Look up `oid` in the map in order to convert it to `algo`.
+    ///
+    /// If this object is in the map, return the offset in the table for the main algorithm.
+    fn look_up_object(&self, oid: &ObjectID) -> Option<usize> {
+        let oid_algo = HashAlgorithm::from_u32(oid.algo)?;
+        let params = self.obj_formats.get(&oid_algo)?;
+        let short_table =
+            &self.memory[params.data_off..params.data_off + (params.shortened_len * self.nitems)];
+        let index =
+            Self::binary_search_slice(short_table, &oid.as_slice()[0..params.shortened_len])?;
+        match params.mapping_off {
+            Some(from_off) => {
+                // oid is in a compatibility algorithm.  Find the mapping index.
+                let mapped = Self::u32_at_offset(self.memory, from_off + index * 4) as usize;
+                if mapped >= self.nitems {
+                    return None;
+                }
+                let oid_offset = params.full_off + mapped * oid_algo.raw_len();
+                if self.memory[oid_offset..oid_offset + oid_algo.raw_len()] != *oid.as_slice() {
+                    return None;
+                }
+                Some(mapped)
+            }
+            None => {
+                // oid is in the main algorithm.  Find the object ID in the main map to confirm
+                // it's correct.
+                let oid_offset = params.full_off + index * oid_algo.raw_len();
+                if self.memory[oid_offset..oid_offset + oid_algo.raw_len()] != *oid.as_slice() {
+                    return None;
+                }
+                Some(index)
+            }
+        }
+    }
+
+    #[allow(dead_code)]
+    fn map_object(&self, oid: &ObjectID, algo: HashAlgorithm) -> Option<MappedObject> {
+        let main = self.look_up_object(oid)?;
+        let meta = MapType::from_u32(Self::u32_at_offset(self.memory, self.meta_off + (main * 4)))?;
+        Some(MappedObject {
+            oid: self.oid_from_offset(main, algo)?,
+            kind: meta,
+        })
+    }
+
+    fn map_oid(&self, oid: &ObjectID, algo: HashAlgorithm) -> Option<ObjectID> {
+        if algo as u32 == oid.algo {
+            return Some(oid.clone());
+        }
+
+        let main = self.look_up_object(oid)?;
+        self.oid_from_offset(main, algo)
+    }
+
+    fn oid_from_offset(&self, offset: usize, algo: HashAlgorithm) -> Option<ObjectID> {
+        let aparams = self.obj_formats.get(&algo)?;
+
+        let mut hash = [0u8; GIT_MAX_RAWSZ];
+        let len = algo.raw_len();
+        let oid_off = aparams.full_off + (offset * len);
+        hash[0..len].copy_from_slice(&self.memory[oid_off..oid_off + len]);
+        Some(ObjectID {
+            hash,
+            algo: algo as u32,
+        })
+    }
+
+    fn u32_at_offset(slice: &[u8], offset: usize) -> u32 {
+        u32::from_be_bytes(slice[offset..offset + 4].try_into().unwrap())
+    }
+
+    fn u64_at_offset(slice: &[u8], offset: usize) -> u64 {
+        u64::from_be_bytes(slice[offset..offset + 8].try_into().unwrap())
+    }
+
+    fn verify_aligned(offset: usize) -> Result<(), MmapedParseError> {
+        if (offset & 3) != 0 {
+            return Err(MmapedParseError::UnalignedData);
+        }
+        Ok(())
+    }
+
+    fn verify_valid(slice: &[u8], offset: u64) -> Result<(), MmapedParseError> {
+        if offset >= slice.len() as u64 {
+            return Err(MmapedParseError::OffsetTooLarge);
+        }
+        Ok(())
+    }
+}
+
+/// A map for loose and other non-packed object IDs that maps between a storage and compatibility
+/// mapping.
+///
+/// In addition to the in-memory option, there is an optional batched storage, which can be used to
+/// write objects to disk in an efficient way.
+pub struct LooseObjectMap {
+    mem: LooseObjectMemoryMap,
+    batch: Option<LooseObjectMemoryMap>,
+}
+
+impl LooseObjectMap {
+    /// Create a new `LooseObjectMap` with the given hash algorithms.
+    ///
+    /// This initializes the memory map to automatically map the empty tree, empty blob, and null
+    /// object ID.
+    pub fn new(storage: HashAlgorithm, compat: HashAlgorithm) -> LooseObjectMap {
+        let mut map = LooseObjectMemoryMap::new(storage, compat);
+        for (main, compat) in &[
+            (storage.empty_tree(), compat.empty_tree()),
+            (storage.empty_blob(), compat.empty_blob()),
+            (storage.null_oid(), compat.null_oid()),
+        ] {
+            map.to_storage.insert(
+                (*compat).clone(),
+                MappedObject {
+                    oid: (*main).clone(),
+                    kind: MapType::Reserved,
+                },
+            );
+            map.to_compat.insert(
+                (*main).clone(),
+                MappedObject {
+                    oid: (*compat).clone(),
+                    kind: MapType::Reserved,
+                },
+            );
+        }
+        LooseObjectMap {
+            mem: map,
+            batch: None,
+        }
+    }
+
+    pub fn hash_algo(&self) -> HashAlgorithm {
+        self.mem.storage
+    }
+
+    /// Start a batch for efficient writing.
+    ///
+    /// If there is already a batch started, this does nothing and the existing batch is retained.
+    pub fn start_batch(&mut self) {
+        if self.batch.is_none() {
+            self.batch = Some(LooseObjectMemoryMap::new(self.mem.storage, self.mem.compat));
+        }
+    }
+
+    pub fn batch_len(&self) -> Option<usize> {
+        self.batch.as_ref().map(|b| b.len())
+    }
+
+    /// If a batch exists, write it to the writer.
+    pub fn finish_batch<W: Write>(&mut self, w: W) -> io::Result<()> {
+        if let Some(txn) = self.batch.take() {
+            txn.write(w)?;
+        }
+        Ok(())
+    }
+
+    /// If a batch exists, write it to the writer.
+    pub fn abort_batch(&mut self) {
+        self.batch = None;
+    }
+
+    /// Return whether there is a batch already started.
+    ///
+    /// If you just want a batch to exist and don't care whether one has already been started, you
+    /// may simply call `start_batch` unconditionally.
+    pub fn has_batch(&self) -> bool {
+        self.batch.is_some()
+    }
+
+    /// Insert an object into the map.
+    ///
+    /// If `write` is true and there is a batch started, write the object into the batch as well as
+    /// into the memory map.
+    pub fn insert(&mut self, oid1: &ObjectID, oid2: &ObjectID, kind: MapType, write: bool) {
+        let (compat_oid, storage_oid) =
+            if HashAlgorithm::from_u32(oid1.algo) == Some(self.mem.compat) {
+                (oid1, oid2)
+            } else {
+                (oid2, oid1)
+            };
+        Self::insert_into(&mut self.mem, storage_oid, compat_oid, kind);
+        if write {
+            if let Some(ref mut batch) = self.batch {
+                Self::insert_into(batch, storage_oid, compat_oid, kind);
+            }
+        }
+    }
+
+    fn insert_into(
+        map: &mut LooseObjectMemoryMap,
+        storage: &ObjectID,
+        compat: &ObjectID,
+        kind: MapType,
+    ) {
+        map.to_compat.insert(
+            storage.clone(),
+            MappedObject {
+                oid: compat.clone(),
+                kind,
+            },
+        );
+        map.to_storage.insert(
+            compat.clone(),
+            MappedObject {
+                oid: storage.clone(),
+                kind,
+            },
+        );
+    }
+
+    #[allow(dead_code)]
+    fn map_object(&self, oid: &ObjectID, algo: HashAlgorithm) -> Option<&MappedObject> {
+        let map = if algo == self.mem.storage {
+            &self.mem.to_storage
+        } else {
+            &self.mem.to_compat
+        };
+        map.get(oid)
+    }
+
+    #[allow(dead_code)]
+    fn map_oid<'a, 'b: 'a>(
+        &'b self,
+        oid: &'a ObjectID,
+        algo: HashAlgorithm,
+    ) -> Option<&'a ObjectID> {
+        if algo as u32 == oid.algo {
+            return Some(oid);
+        }
+        let entry = self.map_object(oid, algo);
+        entry.map(|obj| &obj.oid)
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use super::{LooseObjectMap, LooseObjectMemoryMap, MapType, MmapedLooseObjectMap};
+    use crate::hash::{HashAlgorithm, Hasher, ObjectID};
+    use std::convert::TryInto;
+    use std::io::{self, Cursor, Write};
+
+    struct TrailingWriter {
+        curs: Cursor<Vec<u8>>,
+        hasher: Hasher,
+    }
+
+    impl TrailingWriter {
+        fn new() -> TrailingWriter {
+            TrailingWriter {
+                curs: Cursor::new(Vec::new()),
+                hasher: Hasher::new(HashAlgorithm::SHA256),
+            }
+        }
+
+        fn finalize(mut self) -> Vec<u8> {
+            let _ = self.hasher.flush();
+            let mut v = self.curs.into_inner();
+            v.extend(self.hasher.into_vec());
+            v
+        }
+    }
+
+    impl Write for TrailingWriter {
+        fn write(&mut self, data: &[u8]) -> io::Result<usize> {
+            self.hasher.write_all(data)?;
+            self.curs.write_all(data)?;
+            Ok(data.len())
+        }
+
+        fn flush(&mut self) -> io::Result<()> {
+            self.hasher.flush()?;
+            self.curs.flush()?;
+            Ok(())
+        }
+    }
+
+    fn sha1_oid(b: &[u8]) -> ObjectID {
+        assert_eq!(b.len(), 20);
+        let mut data = [0u8; 32];
+        data[0..20].copy_from_slice(b);
+        ObjectID {
+            hash: data,
+            algo: HashAlgorithm::SHA1 as u32,
+        }
+    }
+
+    fn sha256_oid(b: &[u8]) -> ObjectID {
+        assert_eq!(b.len(), 32);
+        ObjectID {
+            hash: b.try_into().unwrap(),
+            algo: HashAlgorithm::SHA256 as u32,
+        }
+    }
+
+    fn test_entries() -> &'static [(&'static str, &'static [u8], &'static [u8], MapType, bool)] {
+        // These are all example blobs containing the content in the first argument.
+        &[
+            ("abc", b"\xf2\xba\x8f\x84\xab\x5c\x1b\xce\x84\xa7\xb4\x41\xcb\x19\x59\xcf\xc7\x09\x3b\x7f", b"\xc1\xcf\x6e\x46\x50\x77\x93\x0e\x88\xdc\x51\x36\x64\x1d\x40\x2f\x72\xa2\x29\xdd\xd9\x96\xf6\x27\xd6\x0e\x96\x39\xea\xba\x35\xa6", MapType::LooseObject, false),
+            ("def", b"\x0c\x00\x38\x32\xe7\xbf\xa9\xca\x8b\x5c\x20\x35\xc9\xbd\x68\x4a\x5f\x26\x23\xbc", b"\x8a\x90\x17\x26\x48\x4d\xb0\xf2\x27\x9f\x30\x8d\x58\x96\xd9\x6b\xf6\x3a\xd6\xde\x95\x7c\xa3\x8a\xdc\x33\x61\x68\x03\x6e\xf6\x63", MapType::Shallow, true),
+            ("ghi", b"\x45\xa8\x2e\x29\x5c\x52\x47\x31\x14\xc5\x7c\x18\xf4\xf5\x23\x68\xdf\x2a\x3c\xfd", b"\x6e\x47\x4c\x74\xf5\xd7\x78\x14\xc7\xf7\xf0\x7c\x37\x80\x07\x90\x53\x42\xaf\x42\x81\xe6\x86\x8d\x33\x46\x45\x4b\xb8\x63\xab\xc3", MapType::Submodule, false),
+            ("jkl", b"\x45\x32\x8c\x36\xff\x2e\x9b\x9b\x4e\x59\x2c\x84\x7d\x3f\x9a\x7f\xd9\xb3\xe7\x16", b"\xc3\xee\xf7\x54\xa2\x1e\xc6\x9d\x43\x75\xbe\x6f\x18\x47\x89\xa8\x11\x6f\xd9\x66\xfc\x67\xdc\x31\xd2\x11\x15\x42\xc8\xd5\xa0\xaf", MapType::LooseObject, true),
+        ]
+    }
+
+    fn test_map(write_all: bool) -> Box<LooseObjectMap> {
+        let mut map = Box::new(LooseObjectMap::new(
+            HashAlgorithm::SHA256,
+            HashAlgorithm::SHA1,
+        ));
+
+        map.start_batch();
+
+        for (_blob_content, sha1, sha256, kind, swap) in test_entries() {
+            let s256 = sha256_oid(sha256);
+            let s1 = sha1_oid(sha1);
+            let write = write_all || (*kind as u32 & 2) == 0;
+            if *swap {
+                // Insert the item into the batch arbitrarily based on the type.  This tests that
+                // we can specify either order and we'll do the right thing.
+                map.insert(&s256, &s1, *kind, write);
+            } else {
+                map.insert(&s1, &s256, *kind, write);
+            }
+        }
+
+        map
+    }
+
+    #[test]
+    fn can_read_and_write_format() {
+        for full in &[true, false] {
+            let mut map = test_map(*full);
+            let mut wrtr = TrailingWriter::new();
+            map.finish_batch(&mut wrtr).unwrap();
+
+            assert_eq!(map.has_batch(), false);
+
+            let data = wrtr.finalize();
+            MmapedLooseObjectMap::new(&data, HashAlgorithm::SHA256).unwrap();
+        }
+    }
+
+    #[test]
+    fn looks_up_from_mmaped() {
+        let mut map = test_map(true);
+        let mut wrtr = TrailingWriter::new();
+        map.finish_batch(&mut wrtr).unwrap();
+
+        assert_eq!(map.has_batch(), false);
+
+        let data = wrtr.finalize();
+        let entries = test_entries();
+        let map = MmapedLooseObjectMap::new(&data, HashAlgorithm::SHA256).unwrap();
+
+        for (_, sha1, sha256, kind, _) in entries {
+            let s256 = sha256_oid(sha256);
+            let s1 = sha1_oid(sha1);
+
+            let res = map.map_object(&s256, HashAlgorithm::SHA1).unwrap();
+            assert_eq!(res.oid, s1);
+            assert_eq!(res.kind, *kind);
+            let res = map.map_oid(&s256, HashAlgorithm::SHA1).unwrap();
+            assert_eq!(res, s1);
+
+            let res = map.map_object(&s256, HashAlgorithm::SHA256).unwrap();
+            assert_eq!(res.oid, s256);
+            assert_eq!(res.kind, *kind);
+            let res = map.map_oid(&s256, HashAlgorithm::SHA256).unwrap();
+            assert_eq!(res, s256);
+
+            let res = map.map_object(&s1, HashAlgorithm::SHA256).unwrap();
+            assert_eq!(res.oid, s256);
+            assert_eq!(res.kind, *kind);
+            let res = map.map_oid(&s1, HashAlgorithm::SHA256).unwrap();
+            assert_eq!(res, s256);
+
+            let res = map.map_object(&s1, HashAlgorithm::SHA1).unwrap();
+            assert_eq!(res.oid, s1);
+            assert_eq!(res.kind, *kind);
+            let res = map.map_oid(&s1, HashAlgorithm::SHA1).unwrap();
+            assert_eq!(res, s1);
+        }
+
+        for octet in &[0x00u8, 0x6d, 0x6e, 0x8a, 0xff] {
+            let missing_oid = ObjectID {
+                hash: [*octet; 32],
+                algo: HashAlgorithm::SHA256 as u32,
+            };
+
+            assert!(map.map_object(&missing_oid, HashAlgorithm::SHA1).is_none());
+            assert!(map.map_oid(&missing_oid, HashAlgorithm::SHA1).is_none());
+
+            assert_eq!(
+                map.map_oid(&missing_oid, HashAlgorithm::SHA256).unwrap(),
+                missing_oid
+            );
+        }
+    }
+
+    #[test]
+    fn binary_searches_slices_correctly() {
+        let sl = &[
+            0, 1, 2, 15, 14, 13, 18, 10, 2, 20, 20, 20, 21, 21, 0, 21, 21, 1, 21, 21, 21, 21, 21,
+            22, 22, 23, 24,
+        ];
+
+        let expected: &[(&[u8], Option<usize>)] = &[
+            (&[0, 1, 2], Some(0)),
+            (&[15, 14, 13], Some(1)),
+            (&[18, 10, 2], Some(2)),
+            (&[20, 20, 20], Some(3)),
+            (&[21, 21, 0], Some(4)),
+            (&[21, 21, 1], Some(5)),
+            (&[21, 21, 21], Some(6)),
+            (&[21, 21, 22], Some(7)),
+            (&[22, 23, 24], Some(8)),
+            (&[2, 15, 14], None),
+            (&[0, 21, 21], None),
+            (&[21, 21, 23], None),
+            (&[22, 22, 23], None),
+            (&[0xff, 0xff, 0xff], None),
+            (&[0, 0, 0], None),
+        ];
+
+        for (wanted, value) in expected {
+            assert_eq!(
+                MmapedLooseObjectMap::binary_search_slice(sl, wanted),
+                *value
+            );
+        }
+    }
+
+    #[test]
+    fn looks_up_oid_correctly() {
+        let map = test_map(false);
+        let entries = test_entries();
+
+        let s256 = sha256_oid(entries[0].2);
+        let s1 = sha1_oid(entries[0].1);
+
+        let missing_oid = ObjectID {
+            hash: [0xffu8; 32],
+            algo: HashAlgorithm::SHA256 as u32,
+        };
+
+        let res = map.map_object(&s256, HashAlgorithm::SHA1).unwrap();
+        assert_eq!(res.oid, s1);
+        assert_eq!(res.kind, MapType::LooseObject);
+        let res = map.map_oid(&s256, HashAlgorithm::SHA1).unwrap();
+        assert_eq!(*res, s1);
+
+        let res = map.map_object(&s1, HashAlgorithm::SHA256).unwrap();
+        assert_eq!(res.oid, s256);
+        assert_eq!(res.kind, MapType::LooseObject);
+        let res = map.map_oid(&s1, HashAlgorithm::SHA256).unwrap();
+        assert_eq!(*res, s256);
+
+        assert!(map.map_object(&missing_oid, HashAlgorithm::SHA1).is_none());
+        assert!(map.map_oid(&missing_oid, HashAlgorithm::SHA1).is_none());
+
+        assert_eq!(
+            *map.map_oid(&missing_oid, HashAlgorithm::SHA256).unwrap(),
+            missing_oid
+        );
+    }
+
+    #[test]
+    fn looks_up_known_oids_correctly() {
+        let map = test_map(false);
+
+        let funcs: &[&dyn Fn(HashAlgorithm) -> &'static ObjectID] = &[
+            &|h: HashAlgorithm| h.empty_tree(),
+            &|h: HashAlgorithm| h.empty_blob(),
+            &|h: HashAlgorithm| h.null_oid(),
+        ];
+
+        for f in funcs {
+            let s256 = f(HashAlgorithm::SHA256);
+            let s1 = f(HashAlgorithm::SHA1);
+
+            let res = map.map_object(&s256, HashAlgorithm::SHA1).unwrap();
+            assert_eq!(res.oid, *s1);
+            assert_eq!(res.kind, MapType::Reserved);
+            let res = map.map_oid(&s256, HashAlgorithm::SHA1).unwrap();
+            assert_eq!(*res, *s1);
+
+            let res = map.map_object(&s1, HashAlgorithm::SHA256).unwrap();
+            assert_eq!(res.oid, *s256);
+            assert_eq!(res.kind, MapType::Reserved);
+            let res = map.map_oid(&s1, HashAlgorithm::SHA256).unwrap();
+            assert_eq!(*res, *s256);
+        }
+    }
+
+    #[test]
+    fn nul_padding() {
+        assert_eq!(LooseObjectMemoryMap::required_nul_padding(1, 1), 3);
+        assert_eq!(LooseObjectMemoryMap::required_nul_padding(2, 1), 2);
+        assert_eq!(LooseObjectMemoryMap::required_nul_padding(3, 1), 1);
+        assert_eq!(LooseObjectMemoryMap::required_nul_padding(2, 2), 0);
+
+        assert_eq!(LooseObjectMemoryMap::required_nul_padding(39, 3), 3);
+    }
+}
diff --git a/src/meson.build b/src/meson.build
index c77041a3fa..1eea068519 100644
--- a/src/meson.build
+++ b/src/meson.build
@@ -1,6 +1,7 @@
 libgit_rs_sources = [
   'hash.rs',
   'lib.rs',
+  'loose.rs',
   'varint.rs',
 ]
 

^ permalink raw reply related	[flat|nested] 101+ messages in thread

* Re: [PATCH 12/14] rust: add a new binary loose object map format
  2025-10-27  0:44 ` [PATCH 12/14] rust: add a new binary loose object map format brian m. carlson
@ 2025-10-28  9:18   ` Patrick Steinhardt
  2025-10-29  1:37     ` brian m. carlson
  2025-10-29 17:03   ` Junio C Hamano
  2025-10-29 18:21   ` Junio C Hamano
  2 siblings, 1 reply; 101+ messages in thread
From: Patrick Steinhardt @ 2025-10-28  9:18 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Junio C Hamano, Ezekiel Newren

On Mon, Oct 27, 2025 at 12:44:02AM +0000, brian m. carlson wrote:
> Our current loose object format has a few problems.  First, it is not
> efficient: the list of object IDs is not sorted and even if it were,
> there would not be an efficient way to look up objects in both
> algorithms.
> 
> Second, we need to store mappings for things which are not technically
> loose objects but are not packed objects, either, and so cannot be
> stored in a pack index.  These kinds of things include shallows, their
> parents, and their trees, as well as submodules. Yet we also need to
> implement a sensible way to store the kind of object so that we can
> prune unneeded entries.  For instance, if the user has updated the
> shallows, we can remove the old values.

Doesn't this indicate that calling this "loose object map" is kind of a
misnomer? If we want to be able to store arbitrary objects regardless of
the way those are stored (or not stored) in the ODB then I think it's
overall quite confusing to have "loose" in the name.

This isn't something we can fix for the old loose object map. But
shouldn't we fix this now for the new format you're about to introduce?

> For these reasons, introduce a new binary loose object map format.  The
> careful reader will notice that it resembles very closely the pack index
> v3 format.  Add an in-memory loose object map as well, and allow
> enabling writing to a batched map, which can then be written later as
> one of the binary loose object maps.  Include several tests for round
> tripping and data lookup across algorithms.

s/enabling//

> diff --git a/Documentation/gitformat-loose.adoc b/Documentation/gitformat-loose.adoc
> index 947993663e..4850c91669 100644
> --- a/Documentation/gitformat-loose.adoc
> +++ b/Documentation/gitformat-loose.adoc
> @@ -48,6 +50,108 @@ stored under
>  Similarly, a blob containing the contents `abc` would have the uncompressed
>  data of `blob 3\0abc`.
>  
> +== Loose object mapping
> +
> +When the `compatObjectFormat` option is used, Git needs to store a mapping
> +between the repository's main algorithm and the compatibility algorithm. There
> +are two formats for this: the legacy mapping and the modern mapping.
> +
> +=== Legacy mapping
> +
> +The compatibility mapping is stored in a file called
> +`$GIT_DIR/objects/loose-object-idx`.  The format of this file looks like this:
> +
> +  # loose-object-idx
> +  (main-name SP compat-name LF)*
> +
> +`main-name` refers to hexadecimal object ID of the object in the main
> +repository format and `compat-name` refers to the same thing, but for the
> +compatibility format.
> +
> +This format is read if it exists but is not written.
> +
> +Note that carriage returns are not permitted in this file, regardless of the
> +host system or configuration.

As far as I understood, this legacy mapping wasn't really used anywhere
as it is basically nonfunctional in the first place. Can we get away
with dropping it altogether?

> +=== Modern mapping
> +
> +The modern mapping consists of a set of files under `$GIT_DIR/objects/loose`
> +ending in `.map`.  The portion of the filename before the extension is that of
> +the hash checksum in hex format.

Given that we're talking about multiple different hashes: which hash
function is used for this checksum? I assume it's the main hash, but it
might be sensible to document this.

> +`git pack-objects` will repack existing entries into one file, removing any
> +unnecessary objects, such as obsolete shallow entries or loose objects that
> +have been packed.

Curious that this is put into git-pack-objects(1), as it doesn't quite
feel related to the task. Sure, it generates packfiles, but it doesn't
really handle the logic to manage loose objects/packfiles in the repo.
This feels closer to what git-repack(1) is doing, so would that be a
better place to put it?

> +==== Mapping file format
> +
> +- A header appears at the beginning and consists of the following:
> +	* A 4-byte mapping signature: `LMAP`
> +	* 4-byte version number: 1
> +	* 4-byte length of the header section.
> +	* 4-byte number of objects declared in this map file.
> +	* 4-byte number of object formats declared in this map file.
> +  * For each object format:
> +    ** 4-byte format identifier (e.g., `sha1` for SHA-1)
> +    ** 4-byte length in bytes of shortened object names. This is the
> +      shortest possible length needed to make names in the shortened
> +      object name table unambiguous.
> +    ** 8-byte integer, recording where tables relating to this format
> +      are stored in this index file, as an offset from the beginning.

As far as I understand this allows us to even store multiple
compatibility hashes if we were ever to grow a third hash. We would
still be able to binary-search through the file as we can compute the
size of every record with this header.

> +  * 8-byte offset to the trailer from the beginning of this file.
> +	* Zero or more additional key/value pairs (4-byte key, 4-byte value), which
> +		may optionally declare one or more chunks.  No chunks are currently
> +		defined. Readers must ignore unrecognized keys.

How does the reader identify these key/value pairs and know how many of
those there are? Also, do you already have an idea what those should be
used for?

> +- Zero or more NUL bytes.  These are used to improve the alignment of the
> +	4-byte quantities below.

How does one figure out how many NUL bytes there's going to be? I guess
the reader doesn't need to know as it simply uses the length of the
header section to seek to the tables?

> +- Tables for the first object format:
> +	* A sorted table of shortened object names.  These are prefixes of the names
> +		of all objects in this file, packed together without offset values to
> +		reduce the cache footprint of the binary search for a specific object name.

Okay. The length of the shortened object names is encoded in the header,
so all of the objects have the same length.

Does the reader have a way to disambiguate the shortened object names?
They may be unambiguous at the point in time where the mapping is
written, but when they are being shortened it becomes plausible that the
object names becomes ambiguous at a later point in time. 

> +  * A sorted table of full object names.

Ah, I see! We have a second table further down that encodes full object
names, so yes, we can fully disambiguate.

> +	* A table of 4-byte metadata values.
> +	* Zero or more chunks.  A chunk starts with a four-byte chunk identifier and
> +		a four-byte parameter (which, if unneeded, is all zeros) and an eight-byte
> +		size (not including the identifier, parameter, or size), plus the chunk
> +		data.
> +- Zero or more NUL bytes.
> +- Tables for subsequent object formats:
> +	* A sorted table of shortened object names.  These are prefixes of the names
> +		of all objects in this file, packed together without offset values to
> +		reduce the cache footprint of the binary search for a specific object name.
> +  * A table of full object names in the order specified by the first object format.

Interesting, why are these sorted by the first object format again?
Doesn't that mean that I have to do a linear search now to locate the
entry for the second object format?

    Disclaimer: the following paragraphs go into how I would have
    designed this. This is _not_ meant as a "you have to do it this
    way", but as a discussion starter to figure out why you have picked
    the proposed format and for me to get a better understanding of it.

Stepping back a bit, my expectation is that we'd have one lookup table
per object format so that we can map into all directions: SHA1 -> SHA256
and in reverse. If we had more than two hash functions we'd also need to
have a table for e.g. Blake3 -> SHA1 and Blake3 -> SHA256 and reverse.

One way to do this is to have three tables, one for each object format.
The object formats would be ordered lexicographically by their own
object ID, so that one can perform a binary search for an object ID in
every format.

Each row could then either contain all compatibility hashes directly,
but this would explode quite fast in storage space. An optimization
would thus be to have one table per object format that contains the
shortened object ID plus an offset where the actual record can be found.
You know where to find the tables from the header, and you know the
exact size of each entry, so you can trivially perform a binary search
for the abbreviated object ID in that index.

Once you've found that index you take the stored offset to look up the
record in the "main" table. This main table contains the full object IDs
for all object hashes. So something like the following simplified
format:

        +---------------------------------+
        | header                          |
        | Format version                  |
        | Number of object IDs            |
        | SHA1: abbrev, offset            |
        | SHA256: abbrev, offset          |
        | Blake3: abbrev, offset          |
        | Main: offset                    |
        +---------------------------------+
        | table for SHA1                  |
        | 11111 -> 1                      |
        | 22222 -> 2                      |
        +---------------------------------+
        | table for SHA256                |
        | aaaaa -> 2                      |
        | bbbbb -> 1                      |
        +---------------------------------+
        | table for Blake3                |
        | 88888 -> 2                      |
        | 99999 -> 1                      |
        +---------------------------------+
        | main table                      |
        | 11111111 -> bbbbbbbb -> 9999999 |
        | 22222222 -> aaaaaaaa -> 8888888 |
        +---------------------------------+
        | trailer                         |
        | trailer hash                    |
        +---------------------------------+

Overall you only have to store the full object ID for each hash exactly
once, and the mappings also only have to be stored once. But you can
look up an ID by each of its formats via its indices.

With some slight adjustments one could also adapt this format to become
streamable:

  - The header only contains the format information as well as which
    hash functions are contained.

  - The header is followed by the main table. The order of these objects
    is basically the streaming order, we don't care about it. We also
    don't have to abbreviate any hashes here. Like this we can stream
    the mappings to disk one by one, and we only need to remember the
    specific offsets where each mapping was stored.

  - Once all mappings have been streamed we can then write the lookup
    tables. We remember the starting index for each lookup table.

  - The footer contains the number of records stored in the table as
    well as the individual abbreviated object ID lengths per hash. From
    that number it becomes trivial to compute the offsets of every
    single lookup table. The offset of the main table is static.

        +---------------------------------+
        | header                          |
        | Format version                  |
        | SHA1                            |
        | SHA256                          |
        | Blake3                          |
        +---------------------------------+
        | main table                      |
        | 11111111 -> bbbbbbbb -> 9999999 |
        | 22222222 -> aaaaaaaa -> 8888888 |
        +---------------------------------+
        | table for SHA1                  |
        | 11111 -> 1                      |
        | 22222 -> 2                      |
        +---------------------------------+
        | table for SHA256                |
        | aaaaa -> 2                      |
        | bbbbb -> 1                      |
        +---------------------------------+
        | table for Blake3                |
        | 88888 -> 2                      |
        | 99999 -> 1                      |
        +---------------------------------+
        | trailer                         |
        | number of objects               |
        | SHA1 abbrev                     |
        | SHA256 abbrev                   |
        | Blake3 abbrev                   |
        | hash                            |
        +---------------------------------+

Anyway, this is how I would have designed this format, and I think your
format works differently. As I said, my intent here is not to say that
you should take my format, but I mostly intend it as a discussion
starter to figure out why you have chosen the proposed design so that I
can get a better understanding for it.

Thanks!

Patrick

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 12/14] rust: add a new binary loose object map format
  2025-10-28  9:18   ` Patrick Steinhardt
@ 2025-10-29  1:37     ` brian m. carlson
  2025-10-29  9:07       ` Patrick Steinhardt
  0 siblings, 1 reply; 101+ messages in thread
From: brian m. carlson @ 2025-10-29  1:37 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Junio C Hamano, Ezekiel Newren

[-- Attachment #1: Type: text/plain, Size: 11147 bytes --]

On 2025-10-28 at 09:18:32, Patrick Steinhardt wrote:
> Doesn't this indicate that calling this "loose object map" is kind of a
> misnomer? If we want to be able to store arbitrary objects regardless of
> the way those are stored (or not stored) in the ODB then I think it's
> overall quite confusing to have "loose" in the name.
> 
> This isn't something we can fix for the old loose object map. But
> shouldn't we fix this now for the new format you're about to introduce?

Sure.  I will admit I'm terrible at naming things.  What do you think it
should be called.

> s/enabling//

Will fix in v2.

> As far as I understood, this legacy mapping wasn't really used anywhere
> as it is basically nonfunctional in the first place. Can we get away
> with dropping it altogether?

Sure, I can do that.

> Given that we're talking about multiple different hashes: which hash
> function is used for this checksum? I assume it's the main hash, but it
> might be sensible to document this.

It is the main hash.  I'll update that for v2.

> > +`git pack-objects` will repack existing entries into one file, removing any
> > +unnecessary objects, such as obsolete shallow entries or loose objects that
> > +have been packed.
> 
> Curious that this is put into git-pack-objects(1), as it doesn't quite
> feel related to the task. Sure, it generates packfiles, but it doesn't
> really handle the logic to manage loose objects/packfiles in the repo.
> This feels closer to what git-repack(1) is doing, so would that be a
> better place to put it?

I've actually put this into `git gc`, which will come in in a future
series, so I'll update this for v2.

> As far as I understand this allows us to even store multiple
> compatibility hashes if we were ever to grow a third hash. We would
> still be able to binary-search through the file as we can compute the
> size of every record with this header.

Exactly.  We were discussing BLAKE3 at the contributor summit as a
potential option.

The careful reader will note that this format looks suspiciously like
pack index v3, which is intentional.

> > +  * 8-byte offset to the trailer from the beginning of this file.
> > +	* Zero or more additional key/value pairs (4-byte key, 4-byte value), which
> > +		may optionally declare one or more chunks.  No chunks are currently
> > +		defined. Readers must ignore unrecognized keys.
> 
> How does the reader identify these key/value pairs and know how many of
> those there are? Also, do you already have an idea what those should be
> used for?

I'd imagined we could do something like fanout entries for tree
structures to help parse large trees better (since trees cannot be
binary searched).  That's something I wanted to add to multi-pack index
as a set of chunks.

They are read until the end of the header section.

> How does one figure out how many NUL bytes there's going to be? I guess
> the reader doesn't need to know as it simply uses the length of the
> header section to seek to the tables?

Exactly.  This is what we do with pack index v3 as well.  As a practical
matter, every chunk of NUL padding contains 0 to 3 bytes: just enough to
align the data for 4-byte access.

> > +- Tables for the first object format:
> > +	* A sorted table of shortened object names.  These are prefixes of the names
> > +		of all objects in this file, packed together without offset values to
> > +		reduce the cache footprint of the binary search for a specific object name.
> 
> Okay. The length of the shortened object names is encoded in the header,
> so all of the objects have the same length.
> 
> Does the reader have a way to disambiguate the shortened object names?
> They may be unambiguous at the point in time where the mapping is
> written, but when they are being shortened it becomes plausible that the
> object names becomes ambiguous at a later point in time. 
> 
> > +  * A sorted table of full object names.
> 
> Ah, I see! We have a second table further down that encodes full object
> names, so yes, we can fully disambiguate.
> 
> > +	* A table of 4-byte metadata values.
> > +	* Zero or more chunks.  A chunk starts with a four-byte chunk identifier and
> > +		a four-byte parameter (which, if unneeded, is all zeros) and an eight-byte
> > +		size (not including the identifier, parameter, or size), plus the chunk
> > +		data.
> > +- Zero or more NUL bytes.
> > +- Tables for subsequent object formats:
> > +	* A sorted table of shortened object names.  These are prefixes of the names
> > +		of all objects in this file, packed together without offset values to
> > +		reduce the cache footprint of the binary search for a specific object name.
> > +  * A table of full object names in the order specified by the first object format.
> 
> Interesting, why are these sorted by the first object format again?
> Doesn't that mean that I have to do a linear search now to locate the
> entry for the second object format?

No, it doesn't.  The full object names are always in the order of the
first format.  The shortened names for second and subsequent formats
point into an offset table that finds the offset in the first format.

Therefore, to look up an OID in the second format knowing its OID in the
first format, you use the first format's prefixes to find its offset,
verify its OID in the full object names, and then look up that offset in
the list of full object names in the second format.

To go the other way, you find the prefix in the second format, find its
corresponding offset in the mapping table, verify the full object ID in
the second format, and then look up that offset in the full object names
in the first format.

>     Disclaimer: the following paragraphs go into how I would have
>     designed this. This is _not_ meant as a "you have to do it this
>     way", but as a discussion starter to figure out why you have picked
>     the proposed format and for me to get a better understanding of it.

The answer is that it very much resembles pack index v3, except that
instead of having pack order, we just always use the sorted order of the
first object format (since we don't have a pack).  That also makes the
data deterministic so that we always write identical files for identical
objects.

> Stepping back a bit, my expectation is that we'd have one lookup table
> per object format so that we can map into all directions: SHA1 -> SHA256
> and in reverse. If we had more than two hash functions we'd also need to
> have a table for e.g. Blake3 -> SHA1 and Blake3 -> SHA256 and reverse.

Yeah, and then the file gets very large.  We mmap these into memory and
never free them during the life of the program (except when compacting
them and deleting the unused ones), so we want to be quite conservative
with our memory.

> One way to do this is to have three tables, one for each object format.
> The object formats would be ordered lexicographically by their own
> object ID, so that one can perform a binary search for an object ID in
> every format.

We have that with the shortened object IDs and we do a binary search
over those.  This is more cache-friendly and all we need to do is verify
that the full object ID matches our value (as opposed to a different
object stored elsewhere with an identical shortened prefix).

> Each row could then either contain all compatibility hashes directly,
> but this would explode quite fast in storage space. An optimization
> would thus be to have one table per object format that contains the
> shortened object ID plus an offset where the actual record can be found.
> You know where to find the tables from the header, and you know the
> exact size of each entry, so you can trivially perform a binary search
> for the abbreviated object ID in that index.
> 
> Once you've found that index you take the stored offset to look up the
> record in the "main" table. This main table contains the full object IDs
> for all object hashes. So something like the following simplified
> format:
> 
>         +---------------------------------+
>         | header                          |
>         | Format version                  |
>         | Number of object IDs            |
>         | SHA1: abbrev, offset            |
>         | SHA256: abbrev, offset          |
>         | Blake3: abbrev, offset          |
>         | Main: offset                    |
>         +---------------------------------+
>         | table for SHA1                  |
>         | 11111 -> 1                      |
>         | 22222 -> 2                      |
>         +---------------------------------+
>         | table for SHA256                |
>         | aaaaa -> 2                      |
>         | bbbbb -> 1                      |
>         +---------------------------------+
>         | table for Blake3                |
>         | 88888 -> 2                      |
>         | 99999 -> 1                      |
>         +---------------------------------+
>         | main table                      |
>         | 11111111 -> bbbbbbbb -> 9999999 |
>         | 22222222 -> aaaaaaaa -> 8888888 |
>         +---------------------------------+
>         | trailer                         |
>         | trailer hash                    |
>         +---------------------------------+
> 
> Overall you only have to store the full object ID for each hash exactly
> once, and the mappings also only have to be stored once. But you can
> look up an ID by each of its formats via its indices.

This is very similar to what we have now, except that it has mapping
offsets for each algorithm instead of the second and subsequent
algorithms and it re-orders the location of the full object IDs.

I also intentionally wanted to produce completely deterministic output,
since in `git verify-pack` we verify that the output is byte-for-byte
identical and I wanted to have the ability to do that here as well.  (It
isn't implemented yet, but that's a goal.)  In order to do that, we need
to write every part of the data in a fixed order, so we'd have to define
the main table as being sorted by the first algorithm.

> With some slight adjustments one could also adapt this format to become
> streamable:

I don't think these formats are as streamable as you might like.  In
order to create the tables, we need to sort the data for each algorithm
to find the short name length, which requires knowing all of the data up
front in order.

I, too, thought that might be a nice idea, but when I implemented pack
index v3, I realized that effectively all of the data has to be computed
up front.  Once you do that, computing the offsets isn't hard because
it's just some addition and multiplication.

I personally like a header with offsets better than a trailer since it
makes parsing easier.  We can peek at the first 64 bytes of the file to
see if it meets our needs or has data we're interested in.
-- 
brian m. carlson (they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 12/14] rust: add a new binary loose object map format
  2025-10-29  1:37     ` brian m. carlson
@ 2025-10-29  9:07       ` Patrick Steinhardt
  0 siblings, 0 replies; 101+ messages in thread
From: Patrick Steinhardt @ 2025-10-29  9:07 UTC (permalink / raw)
  To: brian m. carlson, git, Junio C Hamano, Ezekiel Newren

On Wed, Oct 29, 2025 at 01:37:49AM +0000, brian m. carlson wrote:
> On 2025-10-28 at 09:18:32, Patrick Steinhardt wrote:
> > Doesn't this indicate that calling this "loose object map" is kind of a
> > misnomer? If we want to be able to store arbitrary objects regardless of
> > the way those are stored (or not stored) in the ODB then I think it's
> > overall quite confusing to have "loose" in the name.
> > 
> > This isn't something we can fix for the old loose object map. But
> > shouldn't we fix this now for the new format you're about to introduce?
> 
> Sure.  I will admit I'm terrible at naming things.  What do you think it
> should be called.

I think the name is quite descriptive despite the misleading "loose"
part. So can't we simply drop that part and call it "object map"?

[snip]
> > > +	* A table of 4-byte metadata values.
> > > +	* Zero or more chunks.  A chunk starts with a four-byte chunk identifier and
> > > +		a four-byte parameter (which, if unneeded, is all zeros) and an eight-byte
> > > +		size (not including the identifier, parameter, or size), plus the chunk
> > > +		data.
> > > +- Zero or more NUL bytes.
> > > +- Tables for subsequent object formats:
> > > +	* A sorted table of shortened object names.  These are prefixes of the names
> > > +		of all objects in this file, packed together without offset values to
> > > +		reduce the cache footprint of the binary search for a specific object name.
> > > +  * A table of full object names in the order specified by the first object format.
> > 
> > Interesting, why are these sorted by the first object format again?
> > Doesn't that mean that I have to do a linear search now to locate the
> > entry for the second object format?
> 
> No, it doesn't.  The full object names are always in the order of the
> first format.  The shortened names for second and subsequent formats
> point into an offset table that finds the offset in the first format.
> 
> Therefore, to look up an OID in the second format knowing its OID in the
> first format, you use the first format's prefixes to find its offset,
> verify its OID in the full object names, and then look up that offset in
> the list of full object names in the second format.
> 
> To go the other way, you find the prefix in the second format, find its
> corresponding offset in the mapping table, verify the full object ID in
> the second format, and then look up that offset in the full object names
> in the first format.

Okay.

[snip]
> > Overall you only have to store the full object ID for each hash exactly
> > once, and the mappings also only have to be stored once. But you can
> > look up an ID by each of its formats via its indices.
> 
> This is very similar to what we have now, except that it has mapping
> offsets for each algorithm instead of the second and subsequent
> algorithms and it re-orders the location of the full object IDs.
> 
> I also intentionally wanted to produce completely deterministic output,
> since in `git verify-pack` we verify that the output is byte-for-byte
> identical and I wanted to have the ability to do that here as well.  (It
> isn't implemented yet, but that's a goal.)  In order to do that, we need
> to write every part of the data in a fixed order, so we'd have to define
> the main table as being sorted by the first algorithm.

Okay.

> > With some slight adjustments one could also adapt this format to become
> > streamable:
> 
> I don't think these formats are as streamable as you might like.  In
> order to create the tables, we need to sort the data for each algorithm
> to find the short name length, which requires knowing all of the data up
> front in order.
> 
> I, too, thought that might be a nice idea, but when I implemented pack
> index v3, I realized that effectively all of the data has to be computed
> up front.  Once you do that, computing the offsets isn't hard because
> it's just some addition and multiplication.

I guess you can make it streamable if you don't care about deterministic
output and if you're willing to have a separate ordered lookup table for
the first hash. But in any case you'd have to keep all object IDs in
memory regardless of that so that those can be sorted. I'm not sure that
this really buys us much.

So overall I'm fine with it not being streamable.

> I personally like a header with offsets better than a trailer since it
> makes parsing easier.  We can peek at the first 64 bytes of the file to
> see if it meets our needs or has data we're interested in.

It's not all that bad -- we for example use this for reftables. Both for
reftables and also for your format we'd mmap anyway, and in order to
mmap you need to figure out the overall size of the file first. From
there on it shouldn't be hard to figure out whether the trailer starts
based on the number of hashes and their respective sizes announced in
the header.

But I remember that this led to some head scratching for myself when I
initially dived into the reftable library, so I very much acknowledge
that it at least adds _some_ complexity.

Anyway, thanks for these explanations! One suggestion: it helped me
quite a bit to draw the ASCII diagrams I had in my previous mail. How
about we add such a diagram to help readers a bit with the high-level
structure of the format?

Patrick

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 12/14] rust: add a new binary loose object map format
  2025-10-27  0:44 ` [PATCH 12/14] rust: add a new binary loose object map format brian m. carlson
  2025-10-28  9:18   ` Patrick Steinhardt
@ 2025-10-29 17:03   ` Junio C Hamano
  2025-10-29 18:21   ` Junio C Hamano
  2 siblings, 0 replies; 101+ messages in thread
From: Junio C Hamano @ 2025-10-29 17:03 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Patrick Steinhardt, Ezekiel Newren

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> Our current loose object format has a few problems.  First, it is not
> efficient: the list of object IDs is not sorted and even if it were,
> there would not be an efficient way to look up objects in both
> algorithms.

I was confused by reading the above, mostly because "our current
loose object format" meant to me the "<type> SP <length-in-decimal>
NUL <payload>" deflated with zlib, which has no list of object IDs.

As Patrick commented you are talking about something else?  Mapping
mechanism for object names between primary and compat hash algorithms?

> +== Loose object mapping
> +
> +When the `compatObjectFormat` option is used, Git needs to store a mapping
> +between the repository's main algorithm and the compatibility algorithm. There
> +are two formats for this: the legacy mapping and the modern mapping.
> +
> +=== Legacy mapping
> +
> +The compatibility mapping is stored in a file called
> +`$GIT_DIR/objects/loose-object-idx`.  The format of this file looks like this:
> +
> +  # loose-object-idx
> +  (main-name SP compat-name LF)*
> +
> +`main-name` refers to hexadecimal object ID of the object in the main
> +repository format and `compat-name` refers to the same thing, but for the
> +compatibility format.
> +
> +This format is read if it exists but is not written.
> +
> +Note that carriage returns are not permitted in this file, regardless of the
> +host system or configuration.

Unless it is zero cost to keep supporting the reading side, perhaps
we want to drop this mapping file format?


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 12/14] rust: add a new binary loose object map format
  2025-10-27  0:44 ` [PATCH 12/14] rust: add a new binary loose object map format brian m. carlson
  2025-10-28  9:18   ` Patrick Steinhardt
  2025-10-29 17:03   ` Junio C Hamano
@ 2025-10-29 18:21   ` Junio C Hamano
  2 siblings, 0 replies; 101+ messages in thread
From: Junio C Hamano @ 2025-10-29 18:21 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Patrick Steinhardt, Ezekiel Newren

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> +=== Modern mapping
> +
> +The modern mapping consists of a set of files under `$GIT_DIR/objects/loose`
> +ending in `.map`.  The portion of the filename before the extension is that of
> +the hash checksum in hex format.
> +
> +`git pack-objects` will repack existing entries into one file, removing any
> +unnecessary objects, such as obsolete shallow entries or loose objects that
> +have been packed.
> +
> +==== Mapping file format

I know near the end of this document we talk about network-byte
order, but let's say that upfront here.

> +- A header appears at the beginning and consists of the following:
> +	* A 4-byte mapping signature: `LMAP`
> +	* 4-byte version number: 1
> +	* 4-byte length of the header section.
> +	* 4-byte number of objects declared in this map file.
> +	* 4-byte number of object formats declared in this map file.
> +  * For each object format:
> +    ** 4-byte format identifier (e.g., `sha1` for SHA-1)
> +    ** 4-byte length in bytes of shortened object names. This is the
> +      shortest possible length needed to make names in the shortened
> +      object name table unambiguous.

This number typically represents a small integer up to 32 or so,
right?  No objection to spend 4-byte for it, but initially I somehow
was confused into thinking that this is the number of bytes for
shortened object names of all the objects in this map file (i.e., (N
* 6) if the map describes N objects, and 6-byte is sufficient prefix
of the object names).  I wonder if there is a way to rephrase the
above to avoid such confusion?

Also I assume that "shorten" refers to "take the first N-byte
prefix".  How about calling them "unique prefix of object names" or
something?

> +    ** 8-byte integer, recording where tables relating to this format
> +      are stored in this index file, as an offset from the beginning.
> +  * 8-byte offset to the trailer from the beginning of this file.

OK.

> +	* Zero or more additional key/value pairs (4-byte key, 4-byte value), which
> +		may optionally declare one or more chunks.  No chunks are currently
> +		defined. Readers must ignore unrecognized keys.

Is this misindented?  In other words, shouldn't the "padding" sit
immediately after "offset of the trailer in the file" and at the
same level?

This uses the word "chunk", which risks implying some relationship
with what is described in Documentation/gitformat-chunk.adoc, but I
suspect this file format has nothing to do with "Chunk-based file
format" described there.  "4-byte key plus 4-byte value" gives an
impression that it is a dictionary to associate bunch of 4-byte
words with 4-byte values, and it is hard to guess where the word
"chunk" comes from.  4-byte keyword plus 4-byte offset into (a later
part of) the file where the chunk defined by that keyword is stored?

The length of the header part minus the size up to the 8-byte offset
to the trailer defines the size occupied by "additional key/value
pairs", so the reader is supposed to tell if the next 4-byte is a
key that it cannot recognise or beyond the end of the header part?

How about replacing this with

* The remainder of the header section is reserved for future use.
  Readers must ignore this section.

until we know what kind of "chunks" are needed?

> +- Zero or more NUL bytes.  These are used to improve the alignment of the
> +	4-byte quantities below.

Everything we saw so far, if the tail end of the header section that
is reserved for future use would hold zero or more <4-byte key,
4-byte value> pairs, are of size divisible by 4.

If anything, we may be better off saying 

 * all the sections described below are placed contiguously without
   gap in the file

 * all the sections are padded with zero or more NUL bytes to make
   their length a multiple of 4

upfront, even before we start talking about the "header" section.
Then the "Zero or more NUL bytes" here, and the padding between
tables do not have to be explicitly described.

> +- Tables for the first object format:
> +	* A sorted table of shortened object names.  These are prefixes of the names
> +		of all objects in this file, packed together without offset values to
> +		reduce the cache footprint of the binary search for a specific object name.

"packed together without offset values...", while understandable,
smells a bit out of place, especially since you haven't explained
what you are trying to let readers find out from this table when
they have one object name.  Presumably, you have them take the first
"length in bytes of shortened object names" bytes from the object
name they have, binary search in this unique-prefix table for an
entry that matches the prefix, to find out that their object may
appear as the N-th object in the table (but the document hasn't told
the readers that is how this table is designed to be used yet)?  And
using that offset, the reader would probably ensure that the N-th
entry that appears in the next "full object names" table does indeed
fully match the object they have?  If that is the case, it is
obvious that there is no "offset value" needed here, but when the
reader does not even know how this table is supposed to be used,
a sudden mention of "offset values" only confuses them.

> +  * A sorted table of full object names.

I assume that the above two "*" bullet points are supposed to be
aligned (iow, sit at the same level under "Tables for the first
object format").

In any case, our reader with a single object name would have found
out that their object appears as the N-th entry of these two tables.

> +	* A table of 4-byte metadata values.

Again, is this (and the next) "*" bullet point at the same level as
the above two tables?

The number of entries in this table is not specified.  Is it one
4-byte metadata per object described in the table (i.e. our reader
recalls that the header has a 4-byte number of objects declared in
this file)?  IOW, would our reader, after finding out that the
object they have is found as the N-th entry in the previous "full
object names" table, look at the N-th entry of this metadata value
table to find the metadata for their object?

> +	* Zero or more chunks.  A chunk starts with a four-byte chunk identifier and
> +		a four-byte parameter (which, if unneeded, is all zeros) and an eight-byte
> +		size (not including the identifier, parameter, or size), plus the chunk
> +		data.

When the chunk data is not multiple of 4-byte, don't we pad?  If we
do, would the padding included in the 8-byte size?  Or if the first
chunk is of an odd size, would the second chunk be unaligned from
its identifier, parameter and size fields?

Presumably, you will allow older readers to safely skip chunks of
newer type they do not recognise, so a reader is expected to grab
the first 16 bytes for (id, param, size), and if it does not care
about the id, just skip the size bytes to reach the next chunk, so
if we were to pad (which I think would be reasonable, given that you
are padding sections to 4-byte boundaries), the eight-byte size
would also count the padding at the end of the chunk data (if the
chunk data needs padding at the end, that is).  If we make it clear
that these chunks are aligned at 4-byte (or 8-byte, I dunno)
boundaries, then ...

> +- Zero or more NUL bytes.

... we do not need to have this entry whose length is unspecified (I
can guess that you added it to allow the reader to skip to the next
4-byte boundary, but this document does not really specify it).

> +- Tables for subsequent object formats:
> +	* A sorted table of shortened object names.  These are prefixes of the names
> +		of all objects in this file, packed together without offset values to
> +		reduce the cache footprint of the binary search for a specific object name.
> +  * A table of full object names in the order specified by the first object format.
> +	* A table of 4-byte values mapping object name order to the order of the
> +		first object format. For an object in the table of sorted shortened object
> +		names, the value at the corresponding index in this table is the index in
> +		the previous table for that same object.
> +	* Zero or more NUL bytes.

The same comment as the section for the primary object format.  I
assume that the above four "*" bullet points are at the same level,
i.e. one unique-prefix table to let reader with a single object name
to find that their object may be the one at N-th location in the
table, followed by the full object name table to verify that the
N-th object indeed is their object, and then find from that N that
the correponding object name in the other hash is the M-th object
in the table in the first object format, and they go from this M to
the 4-byte metadata for that object?

> +- The trailer consists of the following:
> +  * Hash checksum of all of the above.
> +
> +The lower six bits of each metadata table contain a type field indicating the
> +reason that this object is stored:
> +
> +0::
> +	Reserved.
> +1::
> +	This object is stored as a loose object in the repository.
> +2::
> +	This object is a shallow entry.  The mapping refers to a shallow value
> +	returned by a remote server.
> +3::
> +	This object is a submodule entry.  The mapping refers to the commit stored
> +	representing a submodule.
> +
> +Other data may be stored in this field in the future.  Bits that are not used
> +must be zero.
> +
> +All 4-byte numbers are in network order and must be 4-byte aligned in the file,
> +so the NUL padding may be required in some cases.

The document needs to be clear if the "length" field for each
section counts these padding.

> +impl LooseObjectMemoryMap {
> +    /// Create a new `LooseObjectMemoryMap`.
> +    ///
> +    /// The storage and compatibility `HashAlgorithm` instances are used to store the object IDs in
> +    /// the correct map.
> +    fn new(storage: HashAlgorithm, compat: HashAlgorithm) -> LooseObjectMemoryMap {
> +        LooseObjectMemoryMap {
> +            to_compat: BTreeMap::new(),
> +            to_storage: BTreeMap::new(),
> +            compat,
> +            storage,
> +        }
> +    }
> +
> +    fn len(&self) -> usize {
> +        self.to_compat.len()
> +    }
> +
> +    /// Write this map to an interface implementing `std::io::Write`.
> +    fn write<W: Write>(&self, wrtr: W) -> io::Result<()> {
> +        const VERSION_NUMBER: u32 = 1;
> +        const NUM_OBJECT_FORMATS: u32 = 2;
> +        const PADDING: [u8; 4] = [0u8; 4];
> +
> +        let mut wrtr = wrtr;
> +        let header_size: u32 = 4 + 4 + 4 + 4 + 4 + (4 + 4 + 8) * 2 + 8;

Yikes.  Can this be written in a way that is easier to maintain?
Certainly the earlier run of 4's corresponds to what the code below
writes to wrtr, and I am wondering if we can ask wrtr how many bytes
we have asked it to write so far, or something, without having the
above hard-to-read numbers.

> +        wrtr.write_all(b"LMAP")?;
> +        wrtr.write_all(&VERSION_NUMBER.to_be_bytes())?;
> +        wrtr.write_all(&header_size.to_be_bytes())?;
> +        wrtr.write_all(&(self.to_compat.len() as u32).to_be_bytes())?;
> +        wrtr.write_all(&NUM_OBJECT_FORMATS.to_be_bytes())?;
> +
> +        let storage_short_len = self.find_short_name_len(&self.to_compat, self.storage);
> +        let compat_short_len = self.find_short_name_len(&self.to_storage, self.compat);
> +
> +        let storage_npadding = Self::required_nul_padding(self.to_compat.len(), storage_short_len);
> +        let compat_npadding = Self::required_nul_padding(self.to_compat.len(), compat_short_len);

I said 100-column limit is OK, but I am already hating myself saying
so.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH 13/14] rust: add a small wrapper around the hashfile code
  2025-10-27  0:43 [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 brian m. carlson
                   ` (11 preceding siblings ...)
  2025-10-27  0:44 ` [PATCH 12/14] rust: add a new binary loose object map format brian m. carlson
@ 2025-10-27  0:44 ` brian m. carlson
  2025-10-28 18:19   ` Ezekiel Newren
  2025-10-27  0:44 ` [PATCH 14/14] object-file-convert: always make sure object ID algo is valid brian m. carlson
                   ` (4 subsequent siblings)
  17 siblings, 1 reply; 101+ messages in thread
From: brian m. carlson @ 2025-10-27  0:44 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren

Our new binary loose object map code avoids needing to be intimately
involved with file handling by simply writing data to an object
implement Write.  This makes it very easy to test by writing to a Cursor
wrapping a Vec for tests, and thus decouples it from intimate knowledge
about how we handle files.

However, we will actually want to write our data to an actual file,
since that's the most practical way to persist data.  Implement a
wrapper around the hashfile code that implements the Write trait so that
we can write our loose object map into a file.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 Makefile         |  1 +
 src/csum_file.rs | 81 ++++++++++++++++++++++++++++++++++++++++++++++++
 src/lib.rs       |  1 +
 src/meson.build  |  1 +
 4 files changed, 84 insertions(+)
 create mode 100644 src/csum_file.rs

diff --git a/Makefile b/Makefile
index 2081b13780..8eb31aeed2 100644
--- a/Makefile
+++ b/Makefile
@@ -1521,6 +1521,7 @@ CLAR_TEST_OBJS += $(UNIT_TEST_DIR)/unit-test.o
 
 UNIT_TEST_OBJS += $(UNIT_TEST_DIR)/test-lib.o
 
+RUST_SOURCES += src/csum_file.rs
 RUST_SOURCES += src/hash.rs
 RUST_SOURCES += src/lib.rs
 RUST_SOURCES += src/loose.rs
diff --git a/src/csum_file.rs b/src/csum_file.rs
new file mode 100644
index 0000000000..7f2c6c4fcb
--- /dev/null
+++ b/src/csum_file.rs
@@ -0,0 +1,81 @@
+// This program is free software; you can redistribute it and/or modify
+// it under the terms of the GNU General Public License as published by
+// the Free Software Foundation: version 2 of the License, dated June 1991.
+//
+// This program is distributed in the hope that it will be useful,
+// but WITHOUT ANY WARRANTY; without even the implied warranty of
+// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+// GNU General Public License for more details.
+//
+// You should have received a copy of the GNU General Public License along
+// with this program; if not, see <https://www.gnu.org/licenses/>.
+
+use crate::hash::{HashAlgorithm, GIT_MAX_RAWSZ};
+use std::ffi::CStr;
+use std::io::{self, Write};
+use std::os::raw::c_void;
+
+/// A writer that can write files identified by their hash or containing a trailing hash.
+pub struct HashFile {
+    ptr: *mut c_void,
+    algo: HashAlgorithm,
+}
+
+impl HashFile {
+    /// Create a new HashFile.
+    ///
+    /// The hash used will be `algo`, its name should be in `name`, and an open file descriptor
+    /// pointing to that file should be in `fd`.
+    pub fn new(algo: HashAlgorithm, fd: i32, name: &CStr) -> HashFile {
+        HashFile {
+            ptr: unsafe { c::hashfd(algo.hash_algo_ptr(), fd, name.as_ptr()) },
+            algo,
+        }
+    }
+
+    /// Finalize this HashFile instance.
+    ///
+    /// Returns the hash computed over the data.
+    pub fn finalize(self, component: u32, flags: u32) -> Vec<u8> {
+        let mut result = vec![0u8; GIT_MAX_RAWSZ];
+        unsafe { c::finalize_hashfile(self.ptr, result.as_mut_ptr(), component, flags) };
+        result.truncate(self.algo.raw_len());
+        result
+    }
+}
+
+impl Write for HashFile {
+    fn write(&mut self, data: &[u8]) -> io::Result<usize> {
+        for chunk in data.chunks(u32::MAX as usize) {
+            unsafe {
+                c::hashwrite(
+                    self.ptr,
+                    chunk.as_ptr() as *const c_void,
+                    chunk.len() as u32,
+                )
+            };
+        }
+        Ok(data.len())
+    }
+
+    fn flush(&mut self) -> io::Result<()> {
+        unsafe { c::hashflush(self.ptr) };
+        Ok(())
+    }
+}
+
+pub mod c {
+    use std::os::raw::{c_char, c_int, c_void};
+
+    extern "C" {
+        pub fn hashfd(algop: *const c_void, fd: i32, name: *const c_char) -> *mut c_void;
+        pub fn hashwrite(f: *mut c_void, data: *const c_void, len: u32);
+        pub fn hashflush(f: *mut c_void);
+        pub fn finalize_hashfile(
+            f: *mut c_void,
+            data: *mut u8,
+            component: u32,
+            flags: u32,
+        ) -> c_int;
+    }
+}
diff --git a/src/lib.rs b/src/lib.rs
index 442f9433dc..0c598298b1 100644
--- a/src/lib.rs
+++ b/src/lib.rs
@@ -1,3 +1,4 @@
+pub mod csum_file;
 pub mod hash;
 pub mod loose;
 pub mod varint;
diff --git a/src/meson.build b/src/meson.build
index 1eea068519..45739957b4 100644
--- a/src/meson.build
+++ b/src/meson.build
@@ -1,4 +1,5 @@
 libgit_rs_sources = [
+  'csum_file.rs',
   'hash.rs',
   'lib.rs',
   'loose.rs',

^ permalink raw reply related	[flat|nested] 101+ messages in thread

* Re: [PATCH 13/14] rust: add a small wrapper around the hashfile code
  2025-10-27  0:44 ` [PATCH 13/14] rust: add a small wrapper around the hashfile code brian m. carlson
@ 2025-10-28 18:19   ` Ezekiel Newren
  2025-10-29  1:39     ` brian m. carlson
  0 siblings, 1 reply; 101+ messages in thread
From: Ezekiel Newren @ 2025-10-28 18:19 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Junio C Hamano, Patrick Steinhardt

On Sun, Oct 26, 2025 at 6:44 PM brian m. carlson
<sandals@crustytoothpaste.net> wrote:

> +use crate::hash::{HashAlgorithm, GIT_MAX_RAWSZ};
> +use std::ffi::CStr;
> +use std::io::{self, Write};
> +use std::os::raw::c_void;

std::os::raw has been deprecated, only std::ffi should be used.

> +/// A writer that can write files identified by their hash or containing a trailing hash.
> +pub struct HashFile {
> +    ptr: *mut c_void,
> +    algo: HashAlgorithm,
> +}
> +
> +impl HashFile {
> +    /// Create a new HashFile.
> +    ///
> +    /// The hash used will be `algo`, its name should be in `name`, and an open file descriptor
> +    /// pointing to that file should be in `fd`.
> +    pub fn new(algo: HashAlgorithm, fd: i32, name: &CStr) -> HashFile {
> +        HashFile {
> +            ptr: unsafe { c::hashfd(algo.hash_algo_ptr(), fd, name.as_ptr()) },
> +            algo,
> +        }
> +    }

-    pub fn new(algo: HashAlgorithm, fd: i32, name: &CStr) -> HashFile {
-        HashFile {
+    pub fn new(algo: HashAlgorithm, fd: i32, name: &CStr) -> Self {
+        Self {

> +    /// Finalize this HashFile instance.
> +    ///
> +    /// Returns the hash computed over the data.
> +    pub fn finalize(self, component: u32, flags: u32) -> Vec<u8> {
> +        let mut result = vec![0u8; GIT_MAX_RAWSZ];
> +        unsafe { c::finalize_hashfile(self.ptr, result.as_mut_ptr(), component, flags) };
> +        result.truncate(self.algo.raw_len());
> +        result
> +    }
> +}
> +
> +impl Write for HashFile {
> +    fn write(&mut self, data: &[u8]) -> io::Result<usize> {
> +        for chunk in data.chunks(u32::MAX as usize) {
> +            unsafe {
> +                c::hashwrite(
> +                    self.ptr,
> +                    chunk.as_ptr() as *const c_void,
> +                    chunk.len() as u32,
> +                )
> +            };
> +        }
> +        Ok(data.len())
> +    }
> +
> +    fn flush(&mut self) -> io::Result<()> {
> +        unsafe { c::hashflush(self.ptr) };
> +        Ok(())
> +    }
> +}

It's always nice to implement the _Write_ trait for any type that
consumes &[u8] slices. It makes it easy to use a plethora of standard
library functions.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 13/14] rust: add a small wrapper around the hashfile code
  2025-10-28 18:19   ` Ezekiel Newren
@ 2025-10-29  1:39     ` brian m. carlson
  0 siblings, 0 replies; 101+ messages in thread
From: brian m. carlson @ 2025-10-29  1:39 UTC (permalink / raw)
  To: Ezekiel Newren; +Cc: git, Junio C Hamano, Patrick Steinhardt

[-- Attachment #1: Type: text/plain, Size: 617 bytes --]

On 2025-10-28 at 18:19:27, Ezekiel Newren wrote:
> On Sun, Oct 26, 2025 at 6:44 PM brian m. carlson
> <sandals@crustytoothpaste.net> wrote:
> 
> > +use crate::hash::{HashAlgorithm, GIT_MAX_RAWSZ};
> > +use std::ffi::CStr;
> > +use std::io::{self, Write};
> > +use std::os::raw::c_void;
> 
> std::os::raw has been deprecated, only std::ffi should be used.

std::ffi with the C types is not available until Rust 1.64 and we're not
planning on targeting that for some time.  This was intentional, but
I'll mention it in the commit message for v2.
-- 
brian m. carlson (they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH 14/14] object-file-convert: always make sure object ID algo is valid
  2025-10-27  0:43 [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 brian m. carlson
                   ` (12 preceding siblings ...)
  2025-10-27  0:44 ` [PATCH 13/14] rust: add a small wrapper around the hashfile code brian m. carlson
@ 2025-10-27  0:44 ` brian m. carlson
  2025-10-29 20:07 ` [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 Junio C Hamano
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 101+ messages in thread
From: brian m. carlson @ 2025-10-27  0:44 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren

In some cases, we zero-initialize our object IDs, which sets the algo
member to zero as well, which is not a valid algorithm number.  This is
a bad practice, but we typically paper over it in many cases by simply
substituting the repository's hash algorithm.

However, our new Rust loose object map code doesn't handle this
gracefully and can't find object IDs when the algorithm is zero because
they don't compare equal to those with the correct algo field.  In
addition, the comparison code doesn't have any knowledge of what the
main algorithm is because that's global state, so we can't adjust the
comparison.

To make our code function properly and to avoid propagating these bad
entries, if we get a source object ID with a zero algo, just make a copy
of it with the fixed algorithm.  This has the benefit of also fixing the
object IDs if we're in a single algorithm mode as well.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 object-file-convert.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/object-file-convert.c b/object-file-convert.c
index e44c821084..f8dce94811 100644
--- a/object-file-convert.c
+++ b/object-file-convert.c
@@ -13,7 +13,7 @@
 #include "gpg-interface.h"
 #include "object-file-convert.h"

-int repo_oid_to_algop(struct repository *repo, const struct object_id *src,
+int repo_oid_to_algop(struct repository *repo, const struct object_id *srcoid,
 		      const struct git_hash_algo *to, struct object_id *dest)
 {
 	/*
@@ -21,7 +21,15 @@ int repo_oid_to_algop(struct repository *repo, const struct object_id *src,
 	 * default hash algorithm for that object.
 	 */
 	const struct git_hash_algo *from =
-		src->algo ? &hash_algos[src->algo] : repo->hash_algo;
+		srcoid->algo ? &hash_algos[srcoid->algo] : repo->hash_algo;
+	struct object_id temp;
+	const struct object_id *src = srcoid;
+
+	if (!srcoid->algo) {
+		oidcpy(&temp, srcoid);
+		temp.algo = hash_algo_by_ptr(repo->hash_algo);
+		src = &temp;
+	}

 	if (from == to || !to) {
 		if (src != dest)

^ permalink raw reply related	[flat|nested] 101+ messages in thread

* Re: [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2
  2025-10-27  0:43 [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 brian m. carlson
                   ` (13 preceding siblings ...)
  2025-10-27  0:44 ` [PATCH 14/14] object-file-convert: always make sure object ID algo is valid brian m. carlson
@ 2025-10-29 20:07 ` Junio C Hamano
  2025-10-29 20:15   ` Junio C Hamano
  2025-11-11  0:12 ` Ezekiel Newren
                   ` (2 subsequent siblings)
  17 siblings, 1 reply; 101+ messages in thread
From: Junio C Hamano @ 2025-10-29 20:07 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Patrick Steinhardt, Ezekiel Newren

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> This is the second part of the SHA-1/SHA-256 interoperability work.  It
> introduces our first major use of Rust code to implement a loose object
> format as well as preparatory work to make that happen, including
> changing types to more Rust-friendly ones.  Since Rust will be required
> for the interoperability work, we require that in the testsuite.

So, "make WITH_RUST=YesPlease" on 'seen' seems to barf like so (line
wrapping added by me):

    ...
    AR libgit.a
    CARGO target/release/libgitcore.a
error: the `cargo::` syntax for build script output instructions was added \
	in Rust 1.77.0, but the minimum supported Rust version of `gitcore \
	v0.1.0 (/home/jch/w/git.build)` is 1.49.0.
Switch to the old `cargo:rustc-link-search=.` syntax (note the single colon).
See https://doc.rust-lang.org/cargo/reference/build-scripts.html#outputs-of\
	-the-build-script for more information about build script outputs.
gmake: *** [Makefile:2964: target/release/libgitcore.a] Error 101

We either need to downdate the syntax or do the following, but IIRC,
1.77 is a bit too new for the Debian oldstable?

 Cargo.toml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git c/Cargo.toml w/Cargo.toml
index 2f51bf5d5f..7772321dd7 100644
--- c/Cargo.toml
+++ w/Cargo.toml
@@ -2,7 +2,7 @@
 name = "gitcore"
 version = "0.1.0"
 edition = "2018"
-rust-version = "1.49.0"
+rust-version = "1.77.0"
 
 [lib]
 crate-type = ["staticlib"]

^ permalink raw reply related	[flat|nested] 101+ messages in thread

* Re: [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2
  2025-10-29 20:07 ` [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 Junio C Hamano
@ 2025-10-29 20:15   ` Junio C Hamano
  0 siblings, 0 replies; 101+ messages in thread
From: Junio C Hamano @ 2025-10-29 20:15 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Patrick Steinhardt, Ezekiel Newren

Junio C Hamano <gitster@pobox.com> writes:

> We either need to downdate the syntax or do the following, but IIRC,
> 1.77 is a bit too new for the Debian oldstable?
>
>  Cargo.toml | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git c/Cargo.toml w/Cargo.toml
> index 2f51bf5d5f..7772321dd7 100644
> --- c/Cargo.toml
> +++ w/Cargo.toml
> @@ -2,7 +2,7 @@
>  name = "gitcore"
>  version = "0.1.0"
>  edition = "2018"
> -rust-version = "1.49.0"
> +rust-version = "1.77.0"
>  
>  [lib]
>  crate-type = ["staticlib"]

For now, I'd add this on top of the topic and rebuild 'seen'.

--- >8 ---
Subject: [PATCH] SQUASH??? downgrade build.rs syntax

As the build with "make WITH_RUST=YesPlease" dies like so

    ...
    AR libgit.a
    CARGO target/release/libgitcore.a
error: the `cargo::` syntax for build script output instructions was added in \
    Rust 1.77.0, but the minimum supported Rust version of `gitcore v0.1.0 \
    (/home/gitster/w/git.build)` is 1.49.0.
Switch to the old `cargo:rustc-link-search=.` syntax (note the single colon).
See https://doc.rust-lang.org/cargo/reference/build-scripts.html#outputs-of-\
    the-build-script for more information about build script outputs.
gmake: *** [Makefile:2964: target/release/libgitcore.a] Error 101

work it around by downgrading the syntax as the error messages suggests.
---
 build.rs | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/build.rs b/build.rs
index 136d58c35a..3228367b5d 100644
--- a/build.rs
+++ b/build.rs
@@ -11,11 +11,11 @@
 // with this program; if not, see <https://www.gnu.org/licenses/>.
 
 fn main() {
-    println!("cargo::rustc-link-search=.");
-    println!("cargo::rustc-link-search=reftable");
-    println!("cargo::rustc-link-search=xdiff");
-    println!("cargo::rustc-link-lib=git");
-    println!("cargo::rustc-link-lib=reftable");
-    println!("cargo::rustc-link-lib=z");
-    println!("cargo::rustc-link-lib=xdiff");
+    println!("cargo:rustc-link-search=.");
+    println!("cargo:rustc-link-search=reftable");
+    println!("cargo:rustc-link-search=xdiff");
+    println!("cargo:rustc-link-lib=git");
+    println!("cargo:rustc-link-lib=reftable");
+    println!("cargo:rustc-link-lib=z");
+    println!("cargo:rustc-link-lib=xdiff");
 }
-- 
2.51.2-698-g3eff15350e


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* Re: [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2
  2025-10-27  0:43 [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 brian m. carlson
                   ` (14 preceding siblings ...)
  2025-10-29 20:07 ` [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 Junio C Hamano
@ 2025-11-11  0:12 ` Ezekiel Newren
  2025-11-14 17:25 ` Junio C Hamano
  2025-11-17 22:16 ` [PATCH v2 00/15] " brian m. carlson
  17 siblings, 0 replies; 101+ messages in thread
From: Ezekiel Newren @ 2025-11-11  0:12 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Junio C Hamano, Patrick Steinhardt

On Sun, Oct 26, 2025 at 6:44 PM brian m. carlson
<sandals@crustytoothpaste.net> wrote:
>
> This is the second part of the SHA-1/SHA-256 interoperability work.  It
> introduces our first major use of Rust code to implement a loose object
> format as well as preparatory work to make that happen, including
> changing types to more Rust-friendly ones.  Since Rust will be required
> ...

I'm working on a patch series that converts the Cargo crate into a
Cargo workspace. This means that /src will be moved to /gitcore/src. I
plan on releasing that patch series after v2.52.0 is released. Using a
Cargo workspace over a single crate is discussed partially in [1].
Patrick has decided to let me introduce cbindgen and the Cargo
workspace conversion [2].

[1] Patrick's patch series on cbindgen
https://lore.kernel.org/git/20251023-b4-pks-rust-cbindgen-v1-0-c19b61b03127@pks.im/
[2] Patrick discarding his patch series
https://lore.kernel.org/git/aQ3XOTX0AT_eFc5P@pks.im/

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2
  2025-10-27  0:43 [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 brian m. carlson
                   ` (15 preceding siblings ...)
  2025-11-11  0:12 ` Ezekiel Newren
@ 2025-11-14 17:25 ` Junio C Hamano
  2025-11-14 21:11   ` Junio C Hamano
  2025-11-17  6:56   ` Junio C Hamano
  2025-11-17 22:16 ` [PATCH v2 00/15] " brian m. carlson
  17 siblings, 2 replies; 101+ messages in thread
From: Junio C Hamano @ 2025-11-14 17:25 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Patrick Steinhardt, Ezekiel Newren

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> The new Rust files have adopted an approach that is slightly different
> from some of our other files and placed a license notice at the top.
> This is required because of DCO part (a): "I have the right to submit it
> under the open source license indicated in the file".  It also avoids
> ambiguity if the file is copied into a separate location (such as an LLM
> training corpus).

You may be aware of them already, but just in case, I was looking at
CI breakages and noticed that "cargo clippy" warnings added in
4b44c464 (ci: check for common Rust mistakes via Clippy, 2025-10-15)

   https://github.com/git/git/actions/runs/19346329259/job/55347554528#step:5:73

mostly seem to come from steps 12 and 13 of this series.

Thanks.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2
  2025-11-14 17:25 ` Junio C Hamano
@ 2025-11-14 21:11   ` Junio C Hamano
  2025-11-17  6:56   ` Junio C Hamano
  1 sibling, 0 replies; 101+ messages in thread
From: Junio C Hamano @ 2025-11-14 21:11 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Patrick Steinhardt, Ezekiel Newren

Junio C Hamano <gitster@pobox.com> writes:

> "brian m. carlson" <sandals@crustytoothpaste.net> writes:
>
>> The new Rust files have adopted an approach that is slightly different
>> from some of our other files and placed a license notice at the top.
>> This is required because of DCO part (a): "I have the right to submit it
>> under the open source license indicated in the file".  It also avoids
>> ambiguity if the file is copied into a separate location (such as an LLM
>> training corpus).
>
> You may be aware of them already, but just in case, I was looking at
> CI breakages and noticed that "cargo clippy" warnings added in
> 4b44c464 (ci: check for common Rust mistakes via Clippy, 2025-10-15)
>
>    https://github.com/git/git/actions/runs/19346329259/job/55347554528#step:5:73
>
> mostly seem to come from steps 12 and 13 of this series.
>
> Thanks.

This is what I queued on top for today's integration run in an
attempt to work it around.  I am happy about the changes to
assert_eq!(*, [true|false]), even though I may not be happy that
clippy is unhappy about this particular construct.  I also am not so
unhappy with the "do not needlessly borrow" changes near the end.

The first hunk in src/loose.rs thing is a monkey-see-monkey-do patch
that may or may not make any sense, which I strongly want to be
replaced with a proper update by somebody who knows what they are
doing.

 src/hash.rs  |  2 +-
 src/loose.rs | 15 ++++++++-------
 2 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/src/hash.rs b/src/hash.rs
index 8798a50aef..cc696688af 100644
--- a/src/hash.rs
+++ b/src/hash.rs
@@ -310,7 +310,7 @@ mod tests {
             ];
             for (data, oid) in tests {
                 let mut h = algo.hasher();
-                assert_eq!(h.is_safe(), true);
+                assert!(h.is_safe());
                 // Test that this works incrementally.
                 h.update(&data[0..2]);
                 h.update(&data[2..]);
diff --git a/src/loose.rs b/src/loose.rs
index a4e7d2fa48..8d4264c626 100644
--- a/src/loose.rs
+++ b/src/loose.rs
@@ -700,7 +700,8 @@ mod tests {
         }
     }
 
-    fn test_entries() -> &'static [(&'static str, &'static [u8], &'static [u8], MapType, bool)] {
+    type TwoHashesTestVectorEntry = (&'static str, &'static [u8], &'static [u8], MapType, bool);
+    fn test_entries() -> &'static [TwoHashesTestVectorEntry] {
         // These are all example blobs containing the content in the first argument.
         &[
             ("abc", b"\xf2\xba\x8f\x84\xab\x5c\x1b\xce\x84\xa7\xb4\x41\xcb\x19\x59\xcf\xc7\x09\x3b\x7f", b"\xc1\xcf\x6e\x46\x50\x77\x93\x0e\x88\xdc\x51\x36\x64\x1d\x40\x2f\x72\xa2\x29\xdd\xd9\x96\xf6\x27\xd6\x0e\x96\x39\xea\xba\x35\xa6", MapType::LooseObject, false),
@@ -741,7 +742,7 @@ mod tests {
             let mut wrtr = TrailingWriter::new();
             map.finish_batch(&mut wrtr).unwrap();
 
-            assert_eq!(map.has_batch(), false);
+            assert!(!map.has_batch());
 
             let data = wrtr.finalize();
             MmapedLooseObjectMap::new(&data, HashAlgorithm::SHA256).unwrap();
@@ -754,7 +755,7 @@ mod tests {
         let mut wrtr = TrailingWriter::new();
         map.finish_batch(&mut wrtr).unwrap();
 
-        assert_eq!(map.has_batch(), false);
+        assert!(!map.has_batch());
 
         let data = wrtr.finalize();
         let entries = test_entries();
@@ -886,16 +887,16 @@ mod tests {
             let s256 = f(HashAlgorithm::SHA256);
             let s1 = f(HashAlgorithm::SHA1);
 
-            let res = map.map_object(&s256, HashAlgorithm::SHA1).unwrap();
+            let res = map.map_object(s256, HashAlgorithm::SHA1).unwrap();
             assert_eq!(res.oid, *s1);
             assert_eq!(res.kind, MapType::Reserved);
-            let res = map.map_oid(&s256, HashAlgorithm::SHA1).unwrap();
+            let res = map.map_oid(s256, HashAlgorithm::SHA1).unwrap();
             assert_eq!(*res, *s1);
 
-            let res = map.map_object(&s1, HashAlgorithm::SHA256).unwrap();
+            let res = map.map_object(s1, HashAlgorithm::SHA256).unwrap();
             assert_eq!(res.oid, *s256);
             assert_eq!(res.kind, MapType::Reserved);
-            let res = map.map_oid(&s1, HashAlgorithm::SHA256).unwrap();
+            let res = map.map_oid(s1, HashAlgorithm::SHA256).unwrap();
             assert_eq!(*res, *s256);
         }
     }
-- 
2.52.0-rc2-455-g230fcf2819


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* Re: [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2
  2025-11-14 17:25 ` Junio C Hamano
  2025-11-14 21:11   ` Junio C Hamano
@ 2025-11-17  6:56   ` Junio C Hamano
  2025-11-17 22:09     ` brian m. carlson
  1 sibling, 1 reply; 101+ messages in thread
From: Junio C Hamano @ 2025-11-17  6:56 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Patrick Steinhardt, Ezekiel Newren

Junio C Hamano <gitster@pobox.com> writes:

> "brian m. carlson" <sandals@crustytoothpaste.net> writes:
>
>> The new Rust files have adopted an approach that is slightly different
>> from some of our other files and placed a license notice at the top.
>> This is required because of DCO part (a): "I have the right to submit it
>> under the open source license indicated in the file".  It also avoids
>> ambiguity if the file is copied into a separate location (such as an LLM
>> training corpus).
>
> You may be aware of them already, but just in case, I was looking at
> CI breakages ...

In addition to "cargo clippy" I reported earlier (and attempted to
fix) in a separate message, we have been seeing constant failure of
"win+Meson build" job at GitHub Actions CI.

  https://github.com/git/git/actions/runs/19414557042/job/55540901761#step:6:848

I attempted to build tonight's 'seen' without this topic and it
seemed to stop.

  https://github.com/git/git/actions/runs/19418361570/job/55551045554

This topic may need a bit of help from those who are clueful with
Rust and Windows.

Thanks.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2
  2025-11-17  6:56   ` Junio C Hamano
@ 2025-11-17 22:09     ` brian m. carlson
  2025-11-18  0:13       ` Junio C Hamano
  0 siblings, 1 reply; 101+ messages in thread
From: brian m. carlson @ 2025-11-17 22:09 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Patrick Steinhardt, Ezekiel Newren

[-- Attachment #1: Type: text/plain, Size: 939 bytes --]

On 2025-11-17 at 06:56:07, Junio C Hamano wrote:
> In addition to "cargo clippy" I reported earlier (and attempted to
> fix) in a separate message, we have been seeing constant failure of
> "win+Meson build" job at GitHub Actions CI.
> 
>   https://github.com/git/git/actions/runs/19414557042/job/55540901761#step:6:848
> 
> I attempted to build tonight's 'seen' without this topic and it
> seemed to stop.
> 
>   https://github.com/git/git/actions/runs/19418361570/job/55551045554
> 
> This topic may need a bit of help from those who are clueful with
> Rust and Windows.

I think that has been failing with Rust since well before my code came
in.  It has failed for me for a long time (well over a month), so I have
just ignored it.

I'm going to send v2 shortly, but we can squash in changes and do a v3
if there is something actually broken in this series.
-- 
brian m. carlson (they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2
  2025-11-17 22:09     ` brian m. carlson
@ 2025-11-18  0:13       ` Junio C Hamano
  2025-11-19 23:04         ` brian m. carlson
  0 siblings, 1 reply; 101+ messages in thread
From: Junio C Hamano @ 2025-11-18  0:13 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Patrick Steinhardt, Ezekiel Newren

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> On 2025-11-17 at 06:56:07, Junio C Hamano wrote:
>> In addition to "cargo clippy" I reported earlier (and attempted to
>> fix) in a separate message, we have been seeing constant failure of
>> "win+Meson build" job at GitHub Actions CI.
>> 
>>   https://github.com/git/git/actions/runs/19414557042/job/55540901761#step:6:848
>> 
>> I attempted to build tonight's 'seen' without this topic and it
>> seemed to stop.
>> 
>>   https://github.com/git/git/actions/runs/19418361570/job/55551045554
>> 
>> This topic may need a bit of help from those who are clueful with
>> Rust and Windows.
>
> I think that has been failing with Rust since well before my code came
> in.  It has failed for me for a long time (well over a month), so I have
> just ignored it.
>
> I'm going to send v2 shortly, but we can squash in changes and do a v3
> if there is something actually broken in this series.

Thanks.

    $ git log --oneline --first-parent -4 seen
    3f252ac9fe Merge branch 'ar/run-command-hook' into seen
    672cb7c62e ### CI
    3af201233b Merge branch 'bc/sha1-256-interop-02' into seen
    950efaac03 Merge branch 'cc/fast-import-strip-if-invalid' into seen

It seems that 672cb7c62e (which is an empty commit on top of the
merge of v2 of this series) fails win+Meson

  https://github.com/git/git/actions/runs/19447841443/job/55646336507#step:6:689

but 950efaac03 (which is the merge before v2 of this series is
merged to 'seen') is happy with it.

  https://github.com/git/git/actions/runs/19448271167/job/55647611566

These two runs roughly corresponds to the with=bad/without=good pair
in the message you are reponding to, but with the v1 of this series.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2
  2025-11-18  0:13       ` Junio C Hamano
@ 2025-11-19 23:04         ` brian m. carlson
  2025-11-19 23:24           ` Junio C Hamano
  2025-11-19 23:37           ` Ezekiel Newren
  0 siblings, 2 replies; 101+ messages in thread
From: brian m. carlson @ 2025-11-19 23:04 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Patrick Steinhardt, Ezekiel Newren

[-- Attachment #1: Type: text/plain, Size: 1222 bytes --]

On 2025-11-18 at 00:13:40, Junio C Hamano wrote:
> Thanks.
> 
>     $ git log --oneline --first-parent -4 seen
>     3f252ac9fe Merge branch 'ar/run-command-hook' into seen
>     672cb7c62e ### CI
>     3af201233b Merge branch 'bc/sha1-256-interop-02' into seen
>     950efaac03 Merge branch 'cc/fast-import-strip-if-invalid' into seen
> 
> It seems that 672cb7c62e (which is an empty commit on top of the
> merge of v2 of this series) fails win+Meson
> 
>   https://github.com/git/git/actions/runs/19447841443/job/55646336507#step:6:689
> 
> but 950efaac03 (which is the merge before v2 of this series is
> merged to 'seen') is happy with it.
> 
>   https://github.com/git/git/actions/runs/19448271167/job/55647611566
> 
> These two runs roughly corresponds to the with=bad/without=good pair
> in the message you are reponding to, but with the v1 of this series.

Yes, I think we'll need someone familiar with Windows to take a look at
that.  The message doesn't indicate anything obvious and I don't have
any Windows systems available to investigate.

My guess is that it's something to do with the build.rs file, but I'm
not certain.
-- 
brian m. carlson (they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2
  2025-11-19 23:04         ` brian m. carlson
@ 2025-11-19 23:24           ` Junio C Hamano
  2025-11-19 23:37           ` Ezekiel Newren
  1 sibling, 0 replies; 101+ messages in thread
From: Junio C Hamano @ 2025-11-19 23:24 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Patrick Steinhardt, Ezekiel Newren

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> On 2025-11-18 at 00:13:40, Junio C Hamano wrote:
>> Thanks.
>> 
>>     $ git log --oneline --first-parent -4 seen
>>     3f252ac9fe Merge branch 'ar/run-command-hook' into seen
>>     672cb7c62e ### CI
>>     3af201233b Merge branch 'bc/sha1-256-interop-02' into seen
>>     950efaac03 Merge branch 'cc/fast-import-strip-if-invalid' into seen
>> 
>> It seems that 672cb7c62e (which is an empty commit on top of the
>> merge of v2 of this series) fails win+Meson
>> 
>>   https://github.com/git/git/actions/runs/19447841443/job/55646336507#step:6:689
>> 
>> but 950efaac03 (which is the merge before v2 of this series is
>> merged to 'seen') is happy with it.
>> 
>>   https://github.com/git/git/actions/runs/19448271167/job/55647611566
>> 
>> These two runs roughly corresponds to the with=bad/without=good pair
>> in the message you are reponding to, but with the v1 of this series.
>
> Yes, I think we'll need someone familiar with Windows to take a look at
> that.  The message doesn't indicate anything obvious and I don't have
> any Windows systems available to investigate.
>
> My guess is that it's something to do with the build.rs file, but I'm
> not certain.

Today's pushout includes jk/ci-windows-meson-test-fix that restores
the ability to show the failure log from win+Meson jobs, so we will
hopefully see something a bit more usable than what we saw in the
previous runs.

Thanks.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2
  2025-11-19 23:04         ` brian m. carlson
  2025-11-19 23:24           ` Junio C Hamano
@ 2025-11-19 23:37           ` Ezekiel Newren
  2025-11-20 19:52             ` Ezekiel Newren
  1 sibling, 1 reply; 101+ messages in thread
From: Ezekiel Newren @ 2025-11-19 23:37 UTC (permalink / raw)
  To: brian m. carlson, Junio C Hamano, git, Patrick Steinhardt,
	Ezekiel Newren

On Wed, Nov 19, 2025 at 4:04 PM brian m. carlson
<sandals@crustytoothpaste.net> wrote:
>
> On 2025-11-18 at 00:13:40, Junio C Hamano wrote:
> > Thanks.
> >
> >     $ git log --oneline --first-parent -4 seen
> >     3f252ac9fe Merge branch 'ar/run-command-hook' into seen
> >     672cb7c62e ### CI
> >     3af201233b Merge branch 'bc/sha1-256-interop-02' into seen
> >     950efaac03 Merge branch 'cc/fast-import-strip-if-invalid' into seen
> >
> > It seems that 672cb7c62e (which is an empty commit on top of the
> > merge of v2 of this series) fails win+Meson
> >
> >   https://github.com/git/git/actions/runs/19447841443/job/55646336507#step:6:689
> >
> > but 950efaac03 (which is the merge before v2 of this series is
> > merged to 'seen') is happy with it.
> >
> >   https://github.com/git/git/actions/runs/19448271167/job/55647611566
> >
> > These two runs roughly corresponds to the with=bad/without=good pair
> > in the message you are reponding to, but with the v1 of this series.
>
> Yes, I think we'll need someone familiar with Windows to take a look at
> that.  The message doesn't indicate anything obvious and I don't have
> any Windows systems available to investigate.
>
> My guess is that it's something to do with the build.rs file, but I'm
> not certain.

This was a known issue, that I pointed out, before Patrick's
"Introduce Rust" series was merged in [1].

[1] https://lore.kernel.org/git/CAH=ZcbBjL09Mk3AXBSgmZGvmFtU3Roc2P5rbQsZ-U5DBHYSs7w@mail.gmail.com/

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2
  2025-11-19 23:37           ` Ezekiel Newren
@ 2025-11-20 19:52             ` Ezekiel Newren
  2025-11-20 23:02               ` brian m. carlson
  0 siblings, 1 reply; 101+ messages in thread
From: Ezekiel Newren @ 2025-11-20 19:52 UTC (permalink / raw)
  To: brian m. carlson, Junio C Hamano, git, Patrick Steinhardt,
	Ezekiel Newren

On Wed, Nov 19, 2025 at 4:37 PM Ezekiel Newren <ezekielnewren@gmail.com> wrote:
>
> On Wed, Nov 19, 2025 at 4:04 PM brian m. carlson
> <sandals@crustytoothpaste.net> wrote:
> >
> > On 2025-11-18 at 00:13:40, Junio C Hamano wrote:
> > > Thanks.
> > >
> > >     $ git log --oneline --first-parent -4 seen
> > >     3f252ac9fe Merge branch 'ar/run-command-hook' into seen
> > >     672cb7c62e ### CI
> > >     3af201233b Merge branch 'bc/sha1-256-interop-02' into seen
> > >     950efaac03 Merge branch 'cc/fast-import-strip-if-invalid' into seen
> > >
> > > It seems that 672cb7c62e (which is an empty commit on top of the
> > > merge of v2 of this series) fails win+Meson
> > >
> > >   https://github.com/git/git/actions/runs/19447841443/job/55646336507#step:6:689
> > >
> > > but 950efaac03 (which is the merge before v2 of this series is
> > > merged to 'seen') is happy with it.
> > >
> > >   https://github.com/git/git/actions/runs/19448271167/job/55647611566
> > >
> > > These two runs roughly corresponds to the with=bad/without=good pair
> > > in the message you are reponding to, but with the v1 of this series.
> >
> > Yes, I think we'll need someone familiar with Windows to take a look at
> > that.  The message doesn't indicate anything obvious and I don't have
> > any Windows systems available to investigate.
> >
> > My guess is that it's something to do with the build.rs file, but I'm
> > not certain.
>
> This was a known issue, that I pointed out, before Patrick's
> "Introduce Rust" series was merged in [1].
>
> [1] https://lore.kernel.org/git/CAH=ZcbBjL09Mk3AXBSgmZGvmFtU3Roc2P5rbQsZ-U5DBHYSs7w@mail.gmail.com/


Checkout my retrospective review [1]. Basically if windows + msvc ->
<crate>.lib else lib<crate>.a, but it was coded as just if windows ->
...

In the github ci these are the only windows combos that are tested.
"win build" is windows + gnu + Makefile
"win+Meson build" windows + msvc + Meson

[1] ci windows problems
https://lore.kernel.org/git/CAH=ZcbB8cRgCTp-Q_CxJ4VFNY1+w+C20zgx9bMre4-hNmPrD7g@mail.gmail.com/

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2
  2025-11-20 19:52             ` Ezekiel Newren
@ 2025-11-20 23:02               ` brian m. carlson
  2025-11-20 23:11                 ` Ezekiel Newren
  0 siblings, 1 reply; 101+ messages in thread
From: brian m. carlson @ 2025-11-20 23:02 UTC (permalink / raw)
  To: Ezekiel Newren; +Cc: Junio C Hamano, git, Patrick Steinhardt

[-- Attachment #1: Type: text/plain, Size: 1636 bytes --]

On 2025-11-20 at 19:52:23, Ezekiel Newren wrote:
> Checkout my retrospective review [1]. Basically if windows + msvc ->
> <crate>.lib else lib<crate>.a, but it was coded as just if windows ->
> ...
> 
> In the github ci these are the only windows combos that are tested.
> "win build" is windows + gnu + Makefile
> "win+Meson build" windows + msvc + Meson

So I don't think that fixes the build[0] with this patch:

-- %< --
From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
From: "brian m. carlson" <sandals@crustytoothpaste.net>
Date: Thu, 20 Nov 2025 22:52:37 +0000
Subject: [PATCH] WIP: try fixing CI

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 Makefile           | 2 +-
 src/cargo-meson.sh | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/Makefile b/Makefile
index b05709c5e9..8bdb05e535 100644
--- a/Makefile
+++ b/Makefile
@@ -934,7 +934,7 @@ else
 RUST_TARGET_DIR = target/release
 endif
 
-ifeq ($(uname_S),Windows)
+ifdef MSVC
 RUST_LIB = $(RUST_TARGET_DIR)/gitcore.lib
 else
 RUST_LIB = $(RUST_TARGET_DIR)/libgitcore.a
diff --git a/src/cargo-meson.sh b/src/cargo-meson.sh
index 3998db0435..80c10b22cf 100755
--- a/src/cargo-meson.sh
+++ b/src/cargo-meson.sh
@@ -27,7 +27,7 @@ then
 fi
 
 case "$(cargo -vV | sed -s 's/^host: \(.*\)$/\1/')" in
-	*-windows-*)
+	*-windows-msvc*)
 		LIBNAME=gitcore.lib;;
 	*)
 		LIBNAME=libgitcore.a;;
-- 
2.51.0.338.gd7d06c2dae8
-- %< --

[0] https://github.com/bk2204/git/actions/runs/19553883891/job/55991786359
-- 
brian m. carlson (they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply related	[flat|nested] 101+ messages in thread

* Re: [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2
  2025-11-20 23:02               ` brian m. carlson
@ 2025-11-20 23:11                 ` Ezekiel Newren
  2025-11-20 23:14                   ` Junio C Hamano
  0 siblings, 1 reply; 101+ messages in thread
From: Ezekiel Newren @ 2025-11-20 23:11 UTC (permalink / raw)
  To: brian m. carlson, Ezekiel Newren, Junio C Hamano, git,
	Patrick Steinhardt

On Thu, Nov 20, 2025 at 4:03 PM brian m. carlson
<sandals@crustytoothpaste.net> wrote:
>
> On 2025-11-20 at 19:52:23, Ezekiel Newren wrote:
> > Checkout my retrospective review [1]. Basically if windows + msvc ->
> > <crate>.lib else lib<crate>.a, but it was coded as just if windows ->
> > ...
> >
> > In the github ci these are the only windows combos that are tested.
> > "win build" is windows + gnu + Makefile
> > "win+Meson build" windows + msvc + Meson
>
> So I don't think that fixes the build[0] with this patch:

You are correct. It's part of the whole solution. I'm working on
ironing out all github ci problems in my cargo-workspace patch series
(not yet released). Once I've figured that out I'll publish my series
on the mailing list.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2
  2025-11-20 23:11                 ` Ezekiel Newren
@ 2025-11-20 23:14                   ` Junio C Hamano
  0 siblings, 0 replies; 101+ messages in thread
From: Junio C Hamano @ 2025-11-20 23:14 UTC (permalink / raw)
  To: Ezekiel Newren; +Cc: brian m. carlson, git, Patrick Steinhardt

Ezekiel Newren <ezekielnewren@gmail.com> writes:

> On Thu, Nov 20, 2025 at 4:03 PM brian m. carlson
> <sandals@crustytoothpaste.net> wrote:
>>
>> On 2025-11-20 at 19:52:23, Ezekiel Newren wrote:
>> > Checkout my retrospective review [1]. Basically if windows + msvc ->
>> > <crate>.lib else lib<crate>.a, but it was coded as just if windows ->
>> > ...
>> >
>> > In the github ci these are the only windows combos that are tested.
>> > "win build" is windows + gnu + Makefile
>> > "win+Meson build" windows + msvc + Meson
>>
>> So I don't think that fixes the build[0] with this patch:
>
> You are correct. It's part of the whole solution. I'm working on
> ironing out all github ci problems in my cargo-workspace patch series
> (not yet released). Once I've figured that out I'll publish my series
> on the mailing list.

Thanks for working well together.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v2 00/15] SHA-1/SHA-256 interoperability, part 2
  2025-10-27  0:43 [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 brian m. carlson
                   ` (16 preceding siblings ...)
  2025-11-14 17:25 ` Junio C Hamano
@ 2025-11-17 22:16 ` brian m. carlson
  2025-11-17 22:16   ` [PATCH v2 01/15] repository: require Rust support for interoperability brian m. carlson
                     ` (14 more replies)
  17 siblings, 15 replies; 101+ messages in thread
From: brian m. carlson @ 2025-11-17 22:16 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren

This is the second part of the SHA-1/SHA-256 interoperability work.  It
introduces our first major use of Rust code to implement a object map
format as well as preparatory work to make that happen, including
changing types to more Rust-friendly ones.  Since Rust will be required
for the interoperability work, we require that in the testsuite.

We also verify that our object ID algorithm is valid when looking up
data in the hash map since the Rust code intentionally has no knowledge
about global mutable state like the_repository and so cannot default to
the main hash algorithm when we've zero-initialized a struct object_id.

The advantage to this Rust code is that it is comprehensively tested
with unit testing.  We can serialize our object map and then verify that
we can also load it again and perform various testing, such as whether
certain object IDs are found in the map and mapped correctly. We can
also test our slightly subtle custom binary search code effectively and
be confident that it works, since Rust doesn't provide a way to binary
search slices of variable length.

I have opted not to use an enum type for our hash algorithm and have
preserved the use of uint32_t from v1.  A C enum type would not map
one-to-one with the Rust type (since the C version would use
GIT_HASH_UNKNOWN for unknown values and Rust would use None instead), so
to avoid problems as we generate more of the integration code with
bindgen and cbindgen, I've chosen to leave it as it is.

Changes since v1:

* Use `MAYBE_UNUSED` instead of casting.
* Explain reason for `ObjectID` structure.
* Switch to `Result` in hash algorithm abstraction.
* Add some additional helpers to `ObjectID`.
* Rename function to `hash_algo_ptr_by_number`.
* Switch to `xmalloc`.
* Fix `build.rs` to use syntax compatible with Rust 1.63.
* Remove unneeded libraries from `build.rs`.
* Improve Rust documentation.
* Explain that safe hashing is about untrusted data, not memory safety.
* Add a trait for hashing to allow for future unsafe (trusted data) hashing.
* Rename `Hasher` to `CryptoHasher`.
* Remove description of legacy loose object map.
* Rename loose object map to object map.
* Update documentation for object map to be clearer about padding, alignment, and endianness.
* Explain which hash algorithm is used in object map.
* Remove mention of chunks in object map in favour of generic "additional data".
* Fix indentation in object map documentation.
* Generally clarify object map documentation.
* Fix clippy warnings in Rust code.

brian m. carlson (15):
  repository: require Rust support for interoperability
  conversion: don't crash when no destination algo
  hash: use uint32_t for object_id algorithm
  rust: add a ObjectID struct
  rust: add a hash algorithm abstraction
  hash: add a function to look up hash algo structs
  rust: add additional helpers for ObjectID
  csum-file: define hashwrite's count as a uint32_t
  write-or-die: add an fsync component for the object map
  hash: expose hash context functions to Rust
  rust: add a build.rs script for tests
  rust: add functionality to hash an object
  rust: add a new binary object map format
  rust: add a small wrapper around the hashfile code
  object-file-convert: always make sure object ID algo is valid

 Documentation/gitformat-loose.adoc |  78 +++
 Makefile                           |   5 +-
 build.rs                           |  17 +
 csum-file.c                        |   2 +-
 csum-file.h                        |   2 +-
 hash.c                             |  48 +-
 hash.h                             |  38 +-
 object-file-convert.c              |  14 +-
 oidtree.c                          |   2 +-
 repository.c                       |  12 +-
 repository.h                       |   4 +-
 serve.c                            |   2 +-
 src/csum_file.rs                   |  81 +++
 src/hash.rs                        | 466 +++++++++++++++
 src/lib.rs                         |   3 +
 src/loose.rs                       | 913 +++++++++++++++++++++++++++++
 src/meson.build                    |   3 +
 t/t1006-cat-file.sh                |  82 ++-
 t/t1016-compatObjectFormat.sh      |   6 +
 t/t1500-rev-parse.sh               |   2 +-
 t/t9305-fast-import-signatures.sh  |   4 +-
 t/t9350-fast-export.sh             |   4 +-
 t/test-lib.sh                      |   4 +
 write-or-die.h                     |   4 +-
 24 files changed, 1722 insertions(+), 74 deletions(-)
 create mode 100644 build.rs
 create mode 100644 src/csum_file.rs
 create mode 100644 src/hash.rs
 create mode 100644 src/loose.rs

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v2 01/15] repository: require Rust support for interoperability
  2025-11-17 22:16 ` [PATCH v2 00/15] " brian m. carlson
@ 2025-11-17 22:16   ` brian m. carlson
  2025-11-17 22:16   ` [PATCH v2 02/15] conversion: don't crash when no destination algo brian m. carlson
                     ` (13 subsequent siblings)
  14 siblings, 0 replies; 101+ messages in thread
From: brian m. carlson @ 2025-11-17 22:16 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren

We'll be implementing some of our interoperability code, like the loose
object map, in Rust.  While the code currently compiles with the old
loose object map format, which is written entirely in C, we'll soon
replace that with the Rust-based implementation.

Require the use of Rust for compatibility mode and die if it is not
supported.  Because the repo argument is not used when Rust is missing,
cast it to void to silence the compiler warning, which we do not care
about.

Add a prerequisite in our tests, RUST, that checks if Rust functionality
is available and use it in the tests that handle interoperability.

This is technically a regression in functionality compared to our
existing state, but pack index v3 is not yet implemented and thus the
functionality is mostly quite broken, which is why we've recently marked
this functionality as experimental.  We don't believe anyone is getting
useful use out of the interoperability code in its current state, so no
actual users should be negatively impacted by this change.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 repository.c                      |  8 ++-
 t/t1006-cat-file.sh               | 82 +++++++++++++++++++++----------
 t/t1016-compatObjectFormat.sh     |  6 +++
 t/t1500-rev-parse.sh              |  2 +-
 t/t9305-fast-import-signatures.sh |  4 +-
 t/t9350-fast-export.sh            |  4 +-
 t/test-lib.sh                     |  4 ++
 7 files changed, 77 insertions(+), 33 deletions(-)

diff --git a/repository.c b/repository.c
index 6faf5c7398..186d2c1028 100644
--- a/repository.c
+++ b/repository.c
@@ -3,6 +3,7 @@
 #include "repository.h"
 #include "odb.h"
 #include "config.h"
+#include "gettext.h"
 #include "object.h"
 #include "lockfile.h"
 #include "path.h"
@@ -190,13 +191,18 @@ void repo_set_hash_algo(struct repository *repo, int hash_algo)
 	repo->hash_algo = &hash_algos[hash_algo];
 }
 
-void repo_set_compat_hash_algo(struct repository *repo, int algo)
+void repo_set_compat_hash_algo(struct repository *repo MAYBE_UNUSED, int algo)
 {
+#ifdef WITH_RUST
 	if (hash_algo_by_ptr(repo->hash_algo) == algo)
 		BUG("hash_algo and compat_hash_algo match");
 	repo->compat_hash_algo = algo ? &hash_algos[algo] : NULL;
 	if (repo->compat_hash_algo)
 		repo_read_loose_object_map(repo);
+#else
+	if (algo)
+		die(_("compatibility hash algorithm support requires Rust"));
+#endif
 }
 
 void repo_set_ref_storage_format(struct repository *repo,
diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh
index 1f61b666a7..29a9503523 100755
--- a/t/t1006-cat-file.sh
+++ b/t/t1006-cat-file.sh
@@ -241,10 +241,16 @@ hello_content="Hello World"
 hello_size=$(strlen "$hello_content")
 hello_oid=$(echo_without_newline "$hello_content" | git hash-object --stdin)
 
-test_expect_success "setup" '
+test_expect_success "setup part 1" '
 	git config core.repositoryformatversion 1 &&
-	git config extensions.objectformat $test_hash_algo &&
-	git config extensions.compatobjectformat $test_compat_hash_algo &&
+	git config extensions.objectformat $test_hash_algo
+'
+
+test_expect_success RUST 'compat setup' '
+	git config extensions.compatobjectformat $test_compat_hash_algo
+'
+
+test_expect_success 'setup part 2' '
 	echo_without_newline "$hello_content" > hello &&
 	git update-index --add hello &&
 	echo_without_newline "$hello_content" > "path with spaces" &&
@@ -273,9 +279,13 @@ run_blob_tests () {
     '
 }
 
-hello_compat_oid=$(git rev-parse --output-object-format=$test_compat_hash_algo $hello_oid)
 run_blob_tests $hello_oid
-run_blob_tests $hello_compat_oid
+
+if test_have_prereq RUST
+then
+	hello_compat_oid=$(git rev-parse --output-object-format=$test_compat_hash_algo $hello_oid)
+	run_blob_tests $hello_compat_oid
+fi
 
 test_expect_success '--batch-check without %(rest) considers whole line' '
 	echo "$hello_oid blob $hello_size" >expect &&
@@ -286,62 +296,76 @@ test_expect_success '--batch-check without %(rest) considers whole line' '
 '
 
 tree_oid=$(git write-tree)
-tree_compat_oid=$(git rev-parse --output-object-format=$test_compat_hash_algo $tree_oid)
 tree_size=$((2 * $(test_oid rawsz) + 13 + 24))
-tree_compat_size=$((2 * $(test_oid --hash=compat rawsz) + 13 + 24))
 tree_pretty_content="100644 blob $hello_oid	hello${LF}100755 blob $hello_oid	path with spaces${LF}"
-tree_compat_pretty_content="100644 blob $hello_compat_oid	hello${LF}100755 blob $hello_compat_oid	path with spaces${LF}"
 
 run_tests 'tree' $tree_oid "" $tree_size "" "$tree_pretty_content"
-run_tests 'tree' $tree_compat_oid "" $tree_compat_size "" "$tree_compat_pretty_content"
 run_tests 'blob' "$tree_oid:hello" "100644" $hello_size "" "$hello_content" $hello_oid
-run_tests 'blob' "$tree_compat_oid:hello" "100644" $hello_size "" "$hello_content" $hello_compat_oid
 run_tests 'blob' "$tree_oid:path with spaces" "100755" $hello_size "" "$hello_content" $hello_oid
-run_tests 'blob' "$tree_compat_oid:path with spaces" "100755" $hello_size "" "$hello_content" $hello_compat_oid
+
+if test_have_prereq RUST
+then
+	tree_compat_oid=$(git rev-parse --output-object-format=$test_compat_hash_algo $tree_oid)
+	tree_compat_size=$((2 * $(test_oid --hash=compat rawsz) + 13 + 24))
+	tree_compat_pretty_content="100644 blob $hello_compat_oid	hello${LF}100755 blob $hello_compat_oid	path with spaces${LF}"
+
+	run_tests 'tree' $tree_compat_oid "" $tree_compat_size "" "$tree_compat_pretty_content"
+	run_tests 'blob' "$tree_compat_oid:hello" "100644" $hello_size "" "$hello_content" $hello_compat_oid
+	run_tests 'blob' "$tree_compat_oid:path with spaces" "100755" $hello_size "" "$hello_content" $hello_compat_oid
+fi
 
 commit_message="Initial commit"
 commit_oid=$(echo_without_newline "$commit_message" | git commit-tree $tree_oid)
-commit_compat_oid=$(git rev-parse --output-object-format=$test_compat_hash_algo $commit_oid)
 commit_size=$(($(test_oid hexsz) + 137))
-commit_compat_size=$(($(test_oid --hash=compat hexsz) + 137))
 commit_content="tree $tree_oid
 author $GIT_AUTHOR_NAME <$GIT_AUTHOR_EMAIL> $GIT_AUTHOR_DATE
 committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
 
 $commit_message"
 
-commit_compat_content="tree $tree_compat_oid
+run_tests 'commit' $commit_oid "" $commit_size "$commit_content" "$commit_content"
+
+if test_have_prereq RUST
+then
+	commit_compat_oid=$(git rev-parse --output-object-format=$test_compat_hash_algo $commit_oid)
+	commit_compat_size=$(($(test_oid --hash=compat hexsz) + 137))
+	commit_compat_content="tree $tree_compat_oid
 author $GIT_AUTHOR_NAME <$GIT_AUTHOR_EMAIL> $GIT_AUTHOR_DATE
 committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
 
 $commit_message"
 
-run_tests 'commit' $commit_oid "" $commit_size "$commit_content" "$commit_content"
-run_tests 'commit' $commit_compat_oid "" $commit_compat_size "$commit_compat_content" "$commit_compat_content"
+	run_tests 'commit' $commit_compat_oid "" $commit_compat_size "$commit_compat_content" "$commit_compat_content"
+fi
 
 tag_header_without_oid="type blob
 tag hellotag
 tagger $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL>"
 tag_header_without_timestamp="object $hello_oid
 $tag_header_without_oid"
-tag_compat_header_without_timestamp="object $hello_compat_oid
-$tag_header_without_oid"
 tag_description="This is a tag"
 tag_content="$tag_header_without_timestamp 0 +0000
 
-$tag_description"
-tag_compat_content="$tag_compat_header_without_timestamp 0 +0000
-
 $tag_description"
 
 tag_oid=$(echo_without_newline "$tag_content" | git hash-object -t tag --stdin -w)
 tag_size=$(strlen "$tag_content")
 
-tag_compat_oid=$(git rev-parse --output-object-format=$test_compat_hash_algo $tag_oid)
-tag_compat_size=$(strlen "$tag_compat_content")
-
 run_tests 'tag' $tag_oid "" $tag_size "$tag_content" "$tag_content"
-run_tests 'tag' $tag_compat_oid "" $tag_compat_size "$tag_compat_content" "$tag_compat_content"
+
+if test_have_prereq RUST
+then
+	tag_compat_header_without_timestamp="object $hello_compat_oid
+$tag_header_without_oid"
+	tag_compat_content="$tag_compat_header_without_timestamp 0 +0000
+
+$tag_description"
+
+	tag_compat_oid=$(git rev-parse --output-object-format=$test_compat_hash_algo $tag_oid)
+	tag_compat_size=$(strlen "$tag_compat_content")
+
+	run_tests 'tag' $tag_compat_oid "" $tag_compat_size "$tag_compat_content" "$tag_compat_content"
+fi
 
 test_expect_success "Reach a blob from a tag pointing to it" '
 	echo_without_newline "$hello_content" >expect &&
@@ -590,7 +614,8 @@ flush"
 }
 
 batch_tests $hello_oid $tree_oid $tree_size $commit_oid $commit_size "$commit_content" $tag_oid $tag_size "$tag_content"
-batch_tests $hello_compat_oid $tree_compat_oid $tree_compat_size $commit_compat_oid $commit_compat_size "$commit_compat_content" $tag_compat_oid $tag_compat_size "$tag_compat_content"
+
+test_have_prereq RUST && batch_tests $hello_compat_oid $tree_compat_oid $tree_compat_size $commit_compat_oid $commit_compat_size "$commit_compat_content" $tag_compat_oid $tag_compat_size "$tag_compat_content"
 
 
 test_expect_success FUNNYNAMES 'setup with newline in input' '
@@ -1226,7 +1251,10 @@ test_expect_success 'batch-check with a submodule' '
 	test_unconfig extensions.compatobjectformat &&
 	printf "160000 commit $(test_oid deadbeef)\tsub\n" >tree-with-sub &&
 	tree=$(git mktree <tree-with-sub) &&
-	test_config extensions.compatobjectformat $test_compat_hash_algo &&
+	if test_have_prereq RUST
+	then
+		test_config extensions.compatobjectformat $test_compat_hash_algo
+	fi &&
 
 	git cat-file --batch-check >actual <<-EOF &&
 	$tree:sub
diff --git a/t/t1016-compatObjectFormat.sh b/t/t1016-compatObjectFormat.sh
index 0efce53f3a..92d48b96a1 100755
--- a/t/t1016-compatObjectFormat.sh
+++ b/t/t1016-compatObjectFormat.sh
@@ -8,6 +8,12 @@ test_description='Test how well compatObjectFormat works'
 . ./test-lib.sh
 . "$TEST_DIRECTORY"/lib-gpg.sh
 
+if ! test_have_prereq RUST
+then
+	skip_all='interoperability requires a Git built with Rust'
+	test_done
+fi
+
 # All of the follow variables must be defined in the environment:
 # GIT_AUTHOR_NAME
 # GIT_AUTHOR_EMAIL
diff --git a/t/t1500-rev-parse.sh b/t/t1500-rev-parse.sh
index 7739ab611b..98c5a772bd 100755
--- a/t/t1500-rev-parse.sh
+++ b/t/t1500-rev-parse.sh
@@ -208,7 +208,7 @@ test_expect_success 'rev-parse --show-object-format in repo' '
 '
 
 
-test_expect_success 'rev-parse --show-object-format in repo with compat mode' '
+test_expect_success RUST 'rev-parse --show-object-format in repo with compat mode' '
 	mkdir repo &&
 	(
 		sane_unset GIT_DEFAULT_HASH &&
diff --git a/t/t9305-fast-import-signatures.sh b/t/t9305-fast-import-signatures.sh
index c2b4271658..63c0a2b5c4 100755
--- a/t/t9305-fast-import-signatures.sh
+++ b/t/t9305-fast-import-signatures.sh
@@ -70,7 +70,7 @@ test_expect_success GPGSSH 'strip SSH signature with --signed-commits=strip' '
 	test_must_be_empty log
 '
 
-test_expect_success GPG 'setup a commit with dual OpenPGP signatures on its SHA-1 and SHA-256 formats' '
+test_expect_success RUST,GPG 'setup a commit with dual OpenPGP signatures on its SHA-1 and SHA-256 formats' '
 	# Create a signed SHA-256 commit
 	git init --object-format=sha256 explicit-sha256 &&
 	git -C explicit-sha256 config extensions.compatObjectFormat sha1 &&
@@ -91,7 +91,7 @@ test_expect_success GPG 'setup a commit with dual OpenPGP signatures on its SHA-
 	test_grep -E "^gpgsig-sha256 " out
 '
 
-test_expect_success GPG 'strip both OpenPGP signatures with --signed-commits=warn-strip' '
+test_expect_success RUST,GPG 'strip both OpenPGP signatures with --signed-commits=warn-strip' '
 	git -C explicit-sha256 fast-export --signed-commits=verbatim dual-signed >output &&
 	test_grep -E "^gpgsig sha1 openpgp" output &&
 	test_grep -E "^gpgsig sha256 openpgp" output &&
diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh
index 3d153a4805..784d68b6e5 100755
--- a/t/t9350-fast-export.sh
+++ b/t/t9350-fast-export.sh
@@ -972,7 +972,7 @@ test_expect_success 'fast-export handles --end-of-options' '
 	test_cmp expect actual
 '
 
-test_expect_success GPG 'setup a commit with dual signatures on its SHA-1 and SHA-256 formats' '
+test_expect_success GPG,RUST 'setup a commit with dual signatures on its SHA-1 and SHA-256 formats' '
 	# Create a signed SHA-256 commit
 	git init --object-format=sha256 explicit-sha256 &&
 	git -C explicit-sha256 config extensions.compatObjectFormat sha1 &&
@@ -993,7 +993,7 @@ test_expect_success GPG 'setup a commit with dual signatures on its SHA-1 and SH
 	test_grep -E "^gpgsig-sha256 " out
 '
 
-test_expect_success GPG 'export and import of doubly signed commit' '
+test_expect_success GPG,RUST 'export and import of doubly signed commit' '
 	git -C explicit-sha256 fast-export --signed-commits=verbatim dual-signed >output &&
 	test_grep -E "^gpgsig sha1 openpgp" output &&
 	test_grep -E "^gpgsig sha256 openpgp" output &&
diff --git a/t/test-lib.sh b/t/test-lib.sh
index ef0ab7ec2d..3499a83806 100644
--- a/t/test-lib.sh
+++ b/t/test-lib.sh
@@ -1890,6 +1890,10 @@ test_lazy_prereq LONG_IS_64BIT '
 	test 8 -le "$(build_option sizeof-long)"
 '
 
+test_lazy_prereq RUST '
+	test "$(build_option rust)" = enabled
+'
+
 test_lazy_prereq TIME_IS_64BIT 'test-tool date is64bit'
 test_lazy_prereq TIME_T_IS_64BIT 'test-tool date time_t-is64bit'
 

^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH v2 02/15] conversion: don't crash when no destination algo
  2025-11-17 22:16 ` [PATCH v2 00/15] " brian m. carlson
  2025-11-17 22:16   ` [PATCH v2 01/15] repository: require Rust support for interoperability brian m. carlson
@ 2025-11-17 22:16   ` brian m. carlson
  2025-11-17 22:16   ` [PATCH v2 03/15] hash: use uint32_t for object_id algorithm brian m. carlson
                     ` (12 subsequent siblings)
  14 siblings, 0 replies; 101+ messages in thread
From: brian m. carlson @ 2025-11-17 22:16 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren

When we set up a repository that doesn't have a compatibility hash
algorithm, we set the destination algorithm object to NULL.  In such a
case, we want to silently do nothing instead of crashing, so simply
treat the operation as a no-op and copy the object ID.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 object-file-convert.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/object-file-convert.c b/object-file-convert.c
index 7ab875afe6..e44c821084 100644
--- a/object-file-convert.c
+++ b/object-file-convert.c
@@ -23,7 +23,7 @@ int repo_oid_to_algop(struct repository *repo, const struct object_id *src,
 	const struct git_hash_algo *from =
 		src->algo ? &hash_algos[src->algo] : repo->hash_algo;
 
-	if (from == to) {
+	if (from == to || !to) {
 		if (src != dest)
 			oidcpy(dest, src);
 		return 0;

^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH v2 03/15] hash: use uint32_t for object_id algorithm
  2025-11-17 22:16 ` [PATCH v2 00/15] " brian m. carlson
  2025-11-17 22:16   ` [PATCH v2 01/15] repository: require Rust support for interoperability brian m. carlson
  2025-11-17 22:16   ` [PATCH v2 02/15] conversion: don't crash when no destination algo brian m. carlson
@ 2025-11-17 22:16   ` brian m. carlson
  2025-11-17 22:16   ` [PATCH v2 04/15] rust: add a ObjectID struct brian m. carlson
                     ` (11 subsequent siblings)
  14 siblings, 0 replies; 101+ messages in thread
From: brian m. carlson @ 2025-11-17 22:16 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren

We currently use an int for this value, but we'll define this structure
from Rust in a future commit and we want to ensure that our data types
are exactly identical.  To make that possible, use a uint32_t for the
hash algorithm.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 hash.c       |  6 +++---
 hash.h       | 10 +++++-----
 oidtree.c    |  2 +-
 repository.c |  6 +++---
 repository.h |  4 ++--
 serve.c      |  2 +-
 6 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/hash.c b/hash.c
index 4a04ecb50e..81b4f87027 100644
--- a/hash.c
+++ b/hash.c
@@ -241,7 +241,7 @@ const char *empty_tree_oid_hex(const struct git_hash_algo *algop)
 	return oid_to_hex_r(buf, algop->empty_tree);
 }
 
-int hash_algo_by_name(const char *name)
+uint32_t hash_algo_by_name(const char *name)
 {
 	if (!name)
 		return GIT_HASH_UNKNOWN;
@@ -251,7 +251,7 @@ int hash_algo_by_name(const char *name)
 	return GIT_HASH_UNKNOWN;
 }
 
-int hash_algo_by_id(uint32_t format_id)
+uint32_t hash_algo_by_id(uint32_t format_id)
 {
 	for (size_t i = 1; i < GIT_HASH_NALGOS; i++)
 		if (format_id == hash_algos[i].format_id)
@@ -259,7 +259,7 @@ int hash_algo_by_id(uint32_t format_id)
 	return GIT_HASH_UNKNOWN;
 }
 
-int hash_algo_by_length(size_t len)
+uint32_t hash_algo_by_length(size_t len)
 {
 	for (size_t i = 1; i < GIT_HASH_NALGOS; i++)
 		if (len == hash_algos[i].rawsz)
diff --git a/hash.h b/hash.h
index fae966b23c..99c9c2a0a8 100644
--- a/hash.h
+++ b/hash.h
@@ -211,7 +211,7 @@ static inline void git_SHA256_Clone(git_SHA256_CTX *dst, const git_SHA256_CTX *s
 
 struct object_id {
 	unsigned char hash[GIT_MAX_RAWSZ];
-	int algo;	/* XXX requires 4-byte alignment */
+	uint32_t algo;	/* XXX requires 4-byte alignment */
 };
 
 #define GET_OID_QUIETLY                  01
@@ -344,13 +344,13 @@ static inline void git_hash_final_oid(struct object_id *oid, struct git_hash_ctx
  * Return a GIT_HASH_* constant based on the name.  Returns GIT_HASH_UNKNOWN if
  * the name doesn't match a known algorithm.
  */
-int hash_algo_by_name(const char *name);
+uint32_t hash_algo_by_name(const char *name);
 /* Identical, except based on the format ID. */
-int hash_algo_by_id(uint32_t format_id);
+uint32_t hash_algo_by_id(uint32_t format_id);
 /* Identical, except based on the length. */
-int hash_algo_by_length(size_t len);
+uint32_t hash_algo_by_length(size_t len);
 /* Identical, except for a pointer to struct git_hash_algo. */
-static inline int hash_algo_by_ptr(const struct git_hash_algo *p)
+static inline uint32_t hash_algo_by_ptr(const struct git_hash_algo *p)
 {
 	size_t i;
 	for (i = 0; i < GIT_HASH_NALGOS; i++) {
diff --git a/oidtree.c b/oidtree.c
index 151568f74f..324de94934 100644
--- a/oidtree.c
+++ b/oidtree.c
@@ -10,7 +10,7 @@ struct oidtree_iter_data {
 	oidtree_iter fn;
 	void *arg;
 	size_t *last_nibble_at;
-	int algo;
+	uint32_t algo;
 	uint8_t last_byte;
 };
 
diff --git a/repository.c b/repository.c
index 186d2c1028..ebe719de3c 100644
--- a/repository.c
+++ b/repository.c
@@ -39,7 +39,7 @@ struct repository *the_repository = &the_repo;
 static void set_default_hash_algo(struct repository *repo)
 {
 	const char *hash_name;
-	int algo;
+	uint32_t algo;
 
 	hash_name = getenv("GIT_TEST_DEFAULT_HASH_ALGO");
 	if (!hash_name)
@@ -186,12 +186,12 @@ void repo_set_gitdir(struct repository *repo,
 			repo->gitdir, "index");
 }
 
-void repo_set_hash_algo(struct repository *repo, int hash_algo)
+void repo_set_hash_algo(struct repository *repo, uint32_t hash_algo)
 {
 	repo->hash_algo = &hash_algos[hash_algo];
 }
 
-void repo_set_compat_hash_algo(struct repository *repo MAYBE_UNUSED, int algo)
+void repo_set_compat_hash_algo(struct repository *repo MAYBE_UNUSED, uint32_t algo)
 {
 #ifdef WITH_RUST
 	if (hash_algo_by_ptr(repo->hash_algo) == algo)
diff --git a/repository.h b/repository.h
index 5808a5d610..c0a3543b24 100644
--- a/repository.h
+++ b/repository.h
@@ -193,8 +193,8 @@ struct set_gitdir_args {
 void repo_set_gitdir(struct repository *repo, const char *root,
 		     const struct set_gitdir_args *extra_args);
 void repo_set_worktree(struct repository *repo, const char *path);
-void repo_set_hash_algo(struct repository *repo, int algo);
-void repo_set_compat_hash_algo(struct repository *repo, int compat_algo);
+void repo_set_hash_algo(struct repository *repo, uint32_t algo);
+void repo_set_compat_hash_algo(struct repository *repo, uint32_t compat_algo);
 void repo_set_ref_storage_format(struct repository *repo,
 				 enum ref_storage_format format);
 void initialize_repository(struct repository *repo);
diff --git a/serve.c b/serve.c
index 53ecab3b42..49a6e39b1d 100644
--- a/serve.c
+++ b/serve.c
@@ -14,7 +14,7 @@
 
 static int advertise_sid = -1;
 static int advertise_object_info = -1;
-static int client_hash_algo = GIT_HASH_SHA1_LEGACY;
+static uint32_t client_hash_algo = GIT_HASH_SHA1_LEGACY;
 
 static int always_advertise(struct repository *r UNUSED,
 			    struct strbuf *value UNUSED)

^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH v2 04/15] rust: add a ObjectID struct
  2025-11-17 22:16 ` [PATCH v2 00/15] " brian m. carlson
                     ` (2 preceding siblings ...)
  2025-11-17 22:16   ` [PATCH v2 03/15] hash: use uint32_t for object_id algorithm brian m. carlson
@ 2025-11-17 22:16   ` brian m. carlson
  2025-11-17 22:16   ` [PATCH v2 05/15] rust: add a hash algorithm abstraction brian m. carlson
                     ` (10 subsequent siblings)
  14 siblings, 0 replies; 101+ messages in thread
From: brian m. carlson @ 2025-11-17 22:16 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren

We'd like to be able to write some Rust code that can work with object
IDs.  Add a structure here that's identical to struct object_id in C,
for easy use in sharing across the FFI boundary.  We will use this
structure in several places in hot paths, such as index-pack or
pack-objects when converting between algorithms, so prioritize efficient
interchange over a more idiomatic Rust approach.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 Makefile        |  1 +
 src/hash.rs     | 21 +++++++++++++++++++++
 src/lib.rs      |  1 +
 src/meson.build |  1 +
 4 files changed, 24 insertions(+)
 create mode 100644 src/hash.rs

diff --git a/Makefile b/Makefile
index 7e0f77e298..e1d0ae3691 100644
--- a/Makefile
+++ b/Makefile
@@ -1534,6 +1534,7 @@ CLAR_TEST_OBJS += $(UNIT_TEST_DIR)/unit-test.o
 
 UNIT_TEST_OBJS += $(UNIT_TEST_DIR)/test-lib.o
 
+RUST_SOURCES += src/hash.rs
 RUST_SOURCES += src/lib.rs
 RUST_SOURCES += src/varint.rs
 
diff --git a/src/hash.rs b/src/hash.rs
new file mode 100644
index 0000000000..0219391820
--- /dev/null
+++ b/src/hash.rs
@@ -0,0 +1,21 @@
+// This program is free software; you can redistribute it and/or modify
+// it under the terms of the GNU General Public License as published by
+// the Free Software Foundation: version 2 of the License, dated June 1991.
+//
+// This program is distributed in the hope that it will be useful,
+// but WITHOUT ANY WARRANTY; without even the implied warranty of
+// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+// GNU General Public License for more details.
+//
+// You should have received a copy of the GNU General Public License along
+// with this program; if not, see <https://www.gnu.org/licenses/>.
+
+pub const GIT_MAX_RAWSZ: usize = 32;
+
+/// A binary object ID.
+#[repr(C)]
+#[derive(Debug, Clone, Ord, PartialOrd, Eq, PartialEq)]
+pub struct ObjectID {
+    pub hash: [u8; GIT_MAX_RAWSZ],
+    pub algo: u32,
+}
diff --git a/src/lib.rs b/src/lib.rs
index 9da70d8b57..cf7c962509 100644
--- a/src/lib.rs
+++ b/src/lib.rs
@@ -1 +1,2 @@
+pub mod hash;
 pub mod varint;
diff --git a/src/meson.build b/src/meson.build
index 25b9ad5a14..c77041a3fa 100644
--- a/src/meson.build
+++ b/src/meson.build
@@ -1,4 +1,5 @@
 libgit_rs_sources = [
+  'hash.rs',
   'lib.rs',
   'varint.rs',
 ]

^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH v2 05/15] rust: add a hash algorithm abstraction
  2025-11-17 22:16 ` [PATCH v2 00/15] " brian m. carlson
                     ` (3 preceding siblings ...)
  2025-11-17 22:16   ` [PATCH v2 04/15] rust: add a ObjectID struct brian m. carlson
@ 2025-11-17 22:16   ` brian m. carlson
  2025-11-17 22:16   ` [PATCH v2 06/15] hash: add a function to look up hash algo structs brian m. carlson
                     ` (9 subsequent siblings)
  14 siblings, 0 replies; 101+ messages in thread
From: brian m. carlson @ 2025-11-17 22:16 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren

This works very similarly to the existing one in C except that it
doesn't provide any functionality to hash an object.  We don't currently
need that right now, but the use of those function pointers do make it
substantially more difficult to write a bit-for-bit identical structure
across the C/Rust interface, so omit them for now.

Instead of the more customary "&self", use "self", because the former is
the size of a pointer and the latter is the size of an integer on most
systems.  Don't define an unknown value but use an Option for that
instead.

Update the object ID structure to allow slicing the data appropriately
for the algorithm.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 src/hash.rs | 159 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 159 insertions(+)

diff --git a/src/hash.rs b/src/hash.rs
index 0219391820..0ec0ab0490 100644
--- a/src/hash.rs
+++ b/src/hash.rs
@@ -10,8 +10,25 @@
 // You should have received a copy of the GNU General Public License along
 // with this program; if not, see <https://www.gnu.org/licenses/>.
 
+use std::error::Error;
+use std::fmt::{self, Debug, Display};
+
 pub const GIT_MAX_RAWSZ: usize = 32;
 
+/// An error indicating an invalid hash algorithm.
+///
+/// The contained `u32` is the same as the `algo` field in `ObjectID`.
+#[derive(Debug, Copy, Clone)]
+pub struct InvalidHashAlgorithm(pub u32);
+
+impl Display for InvalidHashAlgorithm {
+    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
+        write!(f, "invalid hash algorithm {}", self.0)
+    }
+}
+
+impl Error for InvalidHashAlgorithm {}
+
 /// A binary object ID.
 #[repr(C)]
 #[derive(Debug, Clone, Ord, PartialOrd, Eq, PartialEq)]
@@ -19,3 +36,145 @@ pub struct ObjectID {
     pub hash: [u8; GIT_MAX_RAWSZ],
     pub algo: u32,
 }
+
+#[allow(dead_code)]
+impl ObjectID {
+    pub fn as_slice(&self) -> Result<&[u8], InvalidHashAlgorithm> {
+        match HashAlgorithm::from_u32(self.algo) {
+            Some(algo) => Ok(&self.hash[0..algo.raw_len()]),
+            None => Err(InvalidHashAlgorithm(self.algo)),
+        }
+    }
+
+    pub fn as_mut_slice(&mut self) -> Result<&mut [u8], InvalidHashAlgorithm> {
+        match HashAlgorithm::from_u32(self.algo) {
+            Some(algo) => Ok(&mut self.hash[0..algo.raw_len()]),
+            None => Err(InvalidHashAlgorithm(self.algo)),
+        }
+    }
+}
+
+/// A hash algorithm,
+#[repr(C)]
+#[derive(Debug, Copy, Clone, Ord, PartialOrd, Eq, PartialEq)]
+pub enum HashAlgorithm {
+    SHA1 = 1,
+    SHA256 = 2,
+}
+
+#[allow(dead_code)]
+impl HashAlgorithm {
+    const SHA1_NULL_OID: ObjectID = ObjectID {
+        hash: [0u8; 32],
+        algo: Self::SHA1 as u32,
+    };
+    const SHA256_NULL_OID: ObjectID = ObjectID {
+        hash: [0u8; 32],
+        algo: Self::SHA256 as u32,
+    };
+
+    const SHA1_EMPTY_TREE: ObjectID = ObjectID {
+        hash: *b"\x4b\x82\x5d\xc6\x42\xcb\x6e\xb9\xa0\x60\xe5\x4b\xf8\xd6\x92\x88\xfb\xee\x49\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00",
+        algo: Self::SHA1 as u32,
+    };
+    const SHA256_EMPTY_TREE: ObjectID = ObjectID {
+        hash: *b"\x6e\xf1\x9b\x41\x22\x5c\x53\x69\xf1\xc1\x04\xd4\x5d\x8d\x85\xef\xa9\xb0\x57\xb5\x3b\x14\xb4\xb9\xb9\x39\xdd\x74\xde\xcc\x53\x21",
+        algo: Self::SHA256 as u32,
+    };
+
+    const SHA1_EMPTY_BLOB: ObjectID = ObjectID {
+        hash: *b"\xe6\x9d\xe2\x9b\xb2\xd1\xd6\x43\x4b\x8b\x29\xae\x77\x5a\xd8\xc2\xe4\x8c\x53\x91\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00",
+        algo: Self::SHA1 as u32,
+    };
+    const SHA256_EMPTY_BLOB: ObjectID = ObjectID {
+        hash: *b"\x47\x3a\x0f\x4c\x3b\xe8\xa9\x36\x81\xa2\x67\xe3\xb1\xe9\xa7\xdc\xda\x11\x85\x43\x6f\xe1\x41\xf7\x74\x91\x20\xa3\x03\x72\x18\x13",
+        algo: Self::SHA256 as u32,
+    };
+
+    /// Return a hash algorithm based on the internal integer ID used by Git.
+    ///
+    /// Returns `None` if the algorithm doesn't indicate a valid algorithm.
+    pub const fn from_u32(algo: u32) -> Option<HashAlgorithm> {
+        match algo {
+            1 => Some(HashAlgorithm::SHA1),
+            2 => Some(HashAlgorithm::SHA256),
+            _ => None,
+        }
+    }
+
+    /// Return a hash algorithm based on the internal integer ID used by Git.
+    ///
+    /// Returns `None` if the algorithm doesn't indicate a valid algorithm.
+    pub const fn from_format_id(algo: u32) -> Option<HashAlgorithm> {
+        match algo {
+            0x73686131 => Some(HashAlgorithm::SHA1),
+            0x73323536 => Some(HashAlgorithm::SHA256),
+            _ => None,
+        }
+    }
+
+    /// The name of this hash algorithm as a string suitable for the configuration file.
+    pub const fn name(self) -> &'static str {
+        match self {
+            HashAlgorithm::SHA1 => "sha1",
+            HashAlgorithm::SHA256 => "sha256",
+        }
+    }
+
+    /// The format ID of this algorithm for binary formats.
+    ///
+    /// Note that when writing this to a data format, it should be written in big-endian format
+    /// explicitly.
+    pub const fn format_id(self) -> u32 {
+        match self {
+            HashAlgorithm::SHA1 => 0x73686131,
+            HashAlgorithm::SHA256 => 0x73323536,
+        }
+    }
+
+    /// The length of binary object IDs in this algorithm in bytes.
+    pub const fn raw_len(self) -> usize {
+        match self {
+            HashAlgorithm::SHA1 => 20,
+            HashAlgorithm::SHA256 => 32,
+        }
+    }
+
+    /// The length of object IDs in this algorithm in hexadecimal characters.
+    pub const fn hex_len(self) -> usize {
+        self.raw_len() * 2
+    }
+
+    /// The number of bytes which is processed by one iteration of this algorithm's compression
+    /// function.
+    pub const fn block_size(self) -> usize {
+        match self {
+            HashAlgorithm::SHA1 => 64,
+            HashAlgorithm::SHA256 => 64,
+        }
+    }
+
+    /// The object ID representing the empty blob.
+    pub const fn empty_blob(self) -> &'static ObjectID {
+        match self {
+            HashAlgorithm::SHA1 => &Self::SHA1_EMPTY_BLOB,
+            HashAlgorithm::SHA256 => &Self::SHA256_EMPTY_BLOB,
+        }
+    }
+
+    /// The object ID representing the empty tree.
+    pub const fn empty_tree(self) -> &'static ObjectID {
+        match self {
+            HashAlgorithm::SHA1 => &Self::SHA1_EMPTY_TREE,
+            HashAlgorithm::SHA256 => &Self::SHA256_EMPTY_TREE,
+        }
+    }
+
+    /// The object ID which is all zeros.
+    pub const fn null_oid(self) -> &'static ObjectID {
+        match self {
+            HashAlgorithm::SHA1 => &Self::SHA1_NULL_OID,
+            HashAlgorithm::SHA256 => &Self::SHA256_NULL_OID,
+        }
+    }
+}

^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH v2 06/15] hash: add a function to look up hash algo structs
  2025-11-17 22:16 ` [PATCH v2 00/15] " brian m. carlson
                     ` (4 preceding siblings ...)
  2025-11-17 22:16   ` [PATCH v2 05/15] rust: add a hash algorithm abstraction brian m. carlson
@ 2025-11-17 22:16   ` brian m. carlson
  2025-11-17 22:16   ` [PATCH v2 07/15] rust: add additional helpers for ObjectID brian m. carlson
                     ` (8 subsequent siblings)
  14 siblings, 0 replies; 101+ messages in thread
From: brian m. carlson @ 2025-11-17 22:16 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren

In C, it's easy for us to look up a hash algorithm structure by its
offset by simply indexing the hash_algos array.  However, in Rust, we
sometimes need a pointer to pass to a C function, but we have our own
hash algorithm abstraction.

To get one from the other, let's provide a simple function that looks up
the C structure from the offset and expose it in Rust.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 hash.c      |  7 +++++++
 hash.h      |  1 +
 src/hash.rs | 14 ++++++++++++++
 3 files changed, 22 insertions(+)

diff --git a/hash.c b/hash.c
index 81b4f87027..97fd473607 100644
--- a/hash.c
+++ b/hash.c
@@ -241,6 +241,13 @@ const char *empty_tree_oid_hex(const struct git_hash_algo *algop)
 	return oid_to_hex_r(buf, algop->empty_tree);
 }
 
+const struct git_hash_algo *hash_algo_ptr_by_number(uint32_t algo)
+{
+	if (algo >= GIT_HASH_NALGOS)
+		return NULL;
+	return &hash_algos[algo];
+}
+
 uint32_t hash_algo_by_name(const char *name)
 {
 	if (!name)
diff --git a/hash.h b/hash.h
index 99c9c2a0a8..709d7585a5 100644
--- a/hash.h
+++ b/hash.h
@@ -340,6 +340,7 @@ static inline void git_hash_final_oid(struct object_id *oid, struct git_hash_ctx
 	ctx->algop->final_oid_fn(oid, ctx);
 }
 
+const struct git_hash_algo *hash_algo_ptr_by_number(uint32_t algo);
 /*
  * Return a GIT_HASH_* constant based on the name.  Returns GIT_HASH_UNKNOWN if
  * the name doesn't match a known algorithm.
diff --git a/src/hash.rs b/src/hash.rs
index 0ec0ab0490..70bb8095e8 100644
--- a/src/hash.rs
+++ b/src/hash.rs
@@ -12,6 +12,7 @@
 
 use std::error::Error;
 use std::fmt::{self, Debug, Display};
+use std::os::raw::c_void;
 
 pub const GIT_MAX_RAWSZ: usize = 32;
 
@@ -177,4 +178,17 @@ impl HashAlgorithm {
             HashAlgorithm::SHA256 => &Self::SHA256_NULL_OID,
         }
     }
+
+    /// A pointer to the C `struct git_hash_algo` for interoperability with C.
+    pub fn hash_algo_ptr(self) -> *const c_void {
+        unsafe { c::hash_algo_ptr_by_number(self as u32) }
+    }
+}
+
+pub mod c {
+    use std::os::raw::c_void;
+
+    extern "C" {
+        pub fn hash_algo_ptr_by_number(n: u32) -> *const c_void;
+    }
 }

^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH v2 07/15] rust: add additional helpers for ObjectID
  2025-11-17 22:16 ` [PATCH v2 00/15] " brian m. carlson
                     ` (5 preceding siblings ...)
  2025-11-17 22:16   ` [PATCH v2 06/15] hash: add a function to look up hash algo structs brian m. carlson
@ 2025-11-17 22:16   ` brian m. carlson
  2025-11-17 22:16   ` [PATCH v2 08/15] csum-file: define hashwrite's count as a uint32_t brian m. carlson
                     ` (7 subsequent siblings)
  14 siblings, 0 replies; 101+ messages in thread
From: brian m. carlson @ 2025-11-17 22:16 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren

Right now, users can internally access the contents of the ObjectID
struct, which can lead to data that is not valid, such as invalid
algorithms or non-zero-padded hash values.  These can cause problems
down the line as we use them more.

Add a constructor for ObjectID that allows us to set these values and
also provide an accessor for the algorithm so that we can access it.  In
addition, provide useful Display and Debug implementations that can
format our data in a useful way.

Now that we have the ability to work with these various components in a
nice way, add some tests as well to make sure that ObjectID and
HashAlgorithm work together as expected.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 src/hash.rs | 133 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 132 insertions(+), 1 deletion(-)

diff --git a/src/hash.rs b/src/hash.rs
index 70bb8095e8..e1fa568661 100644
--- a/src/hash.rs
+++ b/src/hash.rs
@@ -32,7 +32,7 @@ impl Error for InvalidHashAlgorithm {}
 
 /// A binary object ID.
 #[repr(C)]
-#[derive(Debug, Clone, Ord, PartialOrd, Eq, PartialEq)]
+#[derive(Clone, Ord, PartialOrd, Eq, PartialEq)]
 pub struct ObjectID {
     pub hash: [u8; GIT_MAX_RAWSZ],
     pub algo: u32,
@@ -40,6 +40,27 @@ pub struct ObjectID {
 
 #[allow(dead_code)]
 impl ObjectID {
+    /// Return a new object ID with the given algorithm and hash.
+    ///
+    /// `hash` must be exactly the proper length for `algo` and this function panics if it is not.
+    /// The extra internal storage of `hash`, if any, is zero filled.
+    pub fn new(algo: HashAlgorithm, hash: &[u8]) -> Self {
+        let mut data = [0u8; GIT_MAX_RAWSZ];
+        // This verifies that the length of `hash` is correct.
+        data[0..algo.raw_len()].copy_from_slice(hash);
+        Self {
+            hash: data,
+            algo: algo as u32,
+        }
+    }
+
+    /// Return the algorithm for this object ID.
+    ///
+    /// If the algorithm set internally is not valid, this function panics.
+    pub fn algo(&self) -> Result<HashAlgorithm, InvalidHashAlgorithm> {
+        HashAlgorithm::from_u32(self.algo).ok_or(InvalidHashAlgorithm(self.algo))
+    }
+
     pub fn as_slice(&self) -> Result<&[u8], InvalidHashAlgorithm> {
         match HashAlgorithm::from_u32(self.algo) {
             Some(algo) => Ok(&self.hash[0..algo.raw_len()]),
@@ -55,6 +76,41 @@ impl ObjectID {
     }
 }
 
+impl Display for ObjectID {
+    /// Format this object ID as a hex object ID.
+    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
+        let hash = self.as_slice().unwrap();
+        for x in hash {
+            write!(f, "{:02x}", x)?;
+        }
+        Ok(())
+    }
+}
+
+impl Debug for ObjectID {
+    /// Format this object ID as a hex object ID with a colon and name appended to it.
+    ///
+    /// ```
+    /// assert_eq!(
+    ///     format!("{:?}", HashAlgorithm::SHA256.null_oid()),
+    ///     "0000000000000000000000000000000000000000000000000000000000000000:sha256"
+    /// );
+    /// ```
+    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
+        let hash = match self.as_slice() {
+            Ok(hash) => hash,
+            Err(_) => &self.hash,
+        };
+        for x in hash {
+            write!(f, "{:02x}", x)?;
+        }
+        match self.algo() {
+            Ok(algo) => write!(f, ":{}", algo.name()),
+            Err(e) => write!(f, ":invalid-hash-algo-{}", e.0),
+        }
+    }
+}
+
 /// A hash algorithm,
 #[repr(C)]
 #[derive(Debug, Copy, Clone, Ord, PartialOrd, Eq, PartialEq)]
@@ -192,3 +248,78 @@ pub mod c {
         pub fn hash_algo_ptr_by_number(n: u32) -> *const c_void;
     }
 }
+
+#[cfg(test)]
+mod tests {
+    use super::HashAlgorithm;
+
+    fn all_algos() -> &'static [HashAlgorithm] {
+        &[HashAlgorithm::SHA1, HashAlgorithm::SHA256]
+    }
+
+    #[test]
+    fn format_id_round_trips() {
+        for algo in all_algos() {
+            assert_eq!(
+                *algo,
+                HashAlgorithm::from_format_id(algo.format_id()).unwrap()
+            );
+        }
+    }
+
+    #[test]
+    fn offset_round_trips() {
+        for algo in all_algos() {
+            assert_eq!(*algo, HashAlgorithm::from_u32(*algo as u32).unwrap());
+        }
+    }
+
+    #[test]
+    fn slices_have_correct_length() {
+        for algo in all_algos() {
+            for oid in [algo.null_oid(), algo.empty_blob(), algo.empty_tree()] {
+                assert_eq!(oid.as_slice().unwrap().len(), algo.raw_len());
+            }
+        }
+    }
+
+    #[test]
+    fn object_ids_format_correctly() {
+        let entries = &[
+            (
+                HashAlgorithm::SHA1.null_oid(),
+                "0000000000000000000000000000000000000000",
+                "0000000000000000000000000000000000000000:sha1",
+            ),
+            (
+                HashAlgorithm::SHA1.empty_blob(),
+                "e69de29bb2d1d6434b8b29ae775ad8c2e48c5391",
+                "e69de29bb2d1d6434b8b29ae775ad8c2e48c5391:sha1",
+            ),
+            (
+                HashAlgorithm::SHA1.empty_tree(),
+                "4b825dc642cb6eb9a060e54bf8d69288fbee4904",
+                "4b825dc642cb6eb9a060e54bf8d69288fbee4904:sha1",
+            ),
+            (
+                HashAlgorithm::SHA256.null_oid(),
+                "0000000000000000000000000000000000000000000000000000000000000000",
+                "0000000000000000000000000000000000000000000000000000000000000000:sha256",
+            ),
+            (
+                HashAlgorithm::SHA256.empty_blob(),
+                "473a0f4c3be8a93681a267e3b1e9a7dcda1185436fe141f7749120a303721813",
+                "473a0f4c3be8a93681a267e3b1e9a7dcda1185436fe141f7749120a303721813:sha256",
+            ),
+            (
+                HashAlgorithm::SHA256.empty_tree(),
+                "6ef19b41225c5369f1c104d45d8d85efa9b057b53b14b4b9b939dd74decc5321",
+                "6ef19b41225c5369f1c104d45d8d85efa9b057b53b14b4b9b939dd74decc5321:sha256",
+            ),
+        ];
+        for (oid, display, debug) in entries {
+            assert_eq!(format!("{}", oid), *display);
+            assert_eq!(format!("{:?}", oid), *debug);
+        }
+    }
+}

^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH v2 08/15] csum-file: define hashwrite's count as a uint32_t
  2025-11-17 22:16 ` [PATCH v2 00/15] " brian m. carlson
                     ` (6 preceding siblings ...)
  2025-11-17 22:16   ` [PATCH v2 07/15] rust: add additional helpers for ObjectID brian m. carlson
@ 2025-11-17 22:16   ` brian m. carlson
  2025-11-17 22:16   ` [PATCH v2 09/15] write-or-die: add an fsync component for the object map brian m. carlson
                     ` (6 subsequent siblings)
  14 siblings, 0 replies; 101+ messages in thread
From: brian m. carlson @ 2025-11-17 22:16 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren

We want to call this code from Rust and ensure that the types are the
same for compatibility, which is easiest to do if the type is a fixed
size.  Since unsigned int is 32 bits on all the platforms we care about,
define it as a uint32_t instead.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 csum-file.c | 2 +-
 csum-file.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/csum-file.c b/csum-file.c
index 6e21e3cac8..3d3047c776 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -110,7 +110,7 @@ void discard_hashfile(struct hashfile *f)
 	free_hashfile(f);
 }
 
-void hashwrite(struct hashfile *f, const void *buf, unsigned int count)
+void hashwrite(struct hashfile *f, const void *buf, uint32_t count)
 {
 	while (count) {
 		unsigned left = f->buffer_len - f->offset;
diff --git a/csum-file.h b/csum-file.h
index 07ae11024a..ecce9d27b0 100644
--- a/csum-file.h
+++ b/csum-file.h
@@ -63,7 +63,7 @@ void free_hashfile(struct hashfile *f);
  */
 int finalize_hashfile(struct hashfile *, unsigned char *, enum fsync_component, unsigned int);
 void discard_hashfile(struct hashfile *);
-void hashwrite(struct hashfile *, const void *, unsigned int);
+void hashwrite(struct hashfile *, const void *, uint32_t);
 void hashflush(struct hashfile *f);
 void crc32_begin(struct hashfile *);
 uint32_t crc32_end(struct hashfile *);

^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH v2 09/15] write-or-die: add an fsync component for the object map
  2025-11-17 22:16 ` [PATCH v2 00/15] " brian m. carlson
                     ` (7 preceding siblings ...)
  2025-11-17 22:16   ` [PATCH v2 08/15] csum-file: define hashwrite's count as a uint32_t brian m. carlson
@ 2025-11-17 22:16   ` brian m. carlson
  2025-11-17 22:16   ` [PATCH v2 10/15] hash: expose hash context functions to Rust brian m. carlson
                     ` (5 subsequent siblings)
  14 siblings, 0 replies; 101+ messages in thread
From: brian m. carlson @ 2025-11-17 22:16 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren

We'll soon be writing out an object map using the hashfile code. Add an
fsync component to allow us to handle fsyncing it correctly.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 write-or-die.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/write-or-die.h b/write-or-die.h
index 65a5c42a47..ff0408bd84 100644
--- a/write-or-die.h
+++ b/write-or-die.h
@@ -21,6 +21,7 @@ enum fsync_component {
 	FSYNC_COMPONENT_COMMIT_GRAPH		= 1 << 3,
 	FSYNC_COMPONENT_INDEX			= 1 << 4,
 	FSYNC_COMPONENT_REFERENCE		= 1 << 5,
+	FSYNC_COMPONENT_OBJECT_MAP		= 1 << 6,
 };
 
 #define FSYNC_COMPONENTS_OBJECTS (FSYNC_COMPONENT_LOOSE_OBJECT | \
@@ -44,7 +45,8 @@ enum fsync_component {
 			      FSYNC_COMPONENT_PACK_METADATA | \
 			      FSYNC_COMPONENT_COMMIT_GRAPH | \
 			      FSYNC_COMPONENT_INDEX | \
-			      FSYNC_COMPONENT_REFERENCE)
+			      FSYNC_COMPONENT_REFERENCE | \
+			      FSYNC_COMPONENT_OBJECT_MAP)
 
 #ifndef FSYNC_COMPONENTS_PLATFORM_DEFAULT
 #define FSYNC_COMPONENTS_PLATFORM_DEFAULT FSYNC_COMPONENTS_DEFAULT

^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH v2 10/15] hash: expose hash context functions to Rust
  2025-11-17 22:16 ` [PATCH v2 00/15] " brian m. carlson
                     ` (8 preceding siblings ...)
  2025-11-17 22:16   ` [PATCH v2 09/15] write-or-die: add an fsync component for the object map brian m. carlson
@ 2025-11-17 22:16   ` brian m. carlson
  2025-11-17 22:16   ` [PATCH v2 11/15] rust: add a build.rs script for tests brian m. carlson
                     ` (4 subsequent siblings)
  14 siblings, 0 replies; 101+ messages in thread
From: brian m. carlson @ 2025-11-17 22:16 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren

We'd like to be able to hash our data in Rust using the same contexts as
in C.  However, we need our helper functions to not be inline so they
can be linked into the binary appropriately.  In addition, to avoid
managing memory manually and since we don't know the size of the hash
context structure, we want to have simple alloc and free functions we
can use to make sure a context can be easily dynamically created.

Expose the helper functions and create alloc, free, and init functions
we can call.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 hash.c | 35 +++++++++++++++++++++++++++++++++++
 hash.h | 27 +++++++--------------------
 2 files changed, 42 insertions(+), 20 deletions(-)

diff --git a/hash.c b/hash.c
index 97fd473607..553f2008ea 100644
--- a/hash.c
+++ b/hash.c
@@ -248,6 +248,41 @@ const struct git_hash_algo *hash_algo_ptr_by_number(uint32_t algo)
 	return &hash_algos[algo];
 }
 
+struct git_hash_ctx *git_hash_alloc(void)
+{
+	return xmalloc(sizeof(struct git_hash_ctx));
+}
+
+void git_hash_free(struct git_hash_ctx *ctx)
+{
+	free(ctx);
+}
+
+void git_hash_init(struct git_hash_ctx *ctx, const struct git_hash_algo *algop)
+{
+	algop->init_fn(ctx);
+}
+
+void git_hash_clone(struct git_hash_ctx *dst, const struct git_hash_ctx *src)
+{
+	src->algop->clone_fn(dst, src);
+}
+
+void git_hash_update(struct git_hash_ctx *ctx, const void *in, size_t len)
+{
+	ctx->algop->update_fn(ctx, in, len);
+}
+
+void git_hash_final(unsigned char *hash, struct git_hash_ctx *ctx)
+{
+	ctx->algop->final_fn(hash, ctx);
+}
+
+void git_hash_final_oid(struct object_id *oid, struct git_hash_ctx *ctx)
+{
+	ctx->algop->final_oid_fn(oid, ctx);
+}
+
 uint32_t hash_algo_by_name(const char *name)
 {
 	if (!name)
diff --git a/hash.h b/hash.h
index 709d7585a5..d51efce1d3 100644
--- a/hash.h
+++ b/hash.h
@@ -320,27 +320,14 @@ struct git_hash_algo {
 };
 extern const struct git_hash_algo hash_algos[GIT_HASH_NALGOS];
 
-static inline void git_hash_clone(struct git_hash_ctx *dst, const struct git_hash_ctx *src)
-{
-	src->algop->clone_fn(dst, src);
-}
-
-static inline void git_hash_update(struct git_hash_ctx *ctx, const void *in, size_t len)
-{
-	ctx->algop->update_fn(ctx, in, len);
-}
-
-static inline void git_hash_final(unsigned char *hash, struct git_hash_ctx *ctx)
-{
-	ctx->algop->final_fn(hash, ctx);
-}
-
-static inline void git_hash_final_oid(struct object_id *oid, struct git_hash_ctx *ctx)
-{
-	ctx->algop->final_oid_fn(oid, ctx);
-}
-
+void git_hash_init(struct git_hash_ctx *ctx, const struct git_hash_algo *algop);
+void git_hash_clone(struct git_hash_ctx *dst, const struct git_hash_ctx *src);
+void git_hash_update(struct git_hash_ctx *ctx, const void *in, size_t len);
+void git_hash_final(unsigned char *hash, struct git_hash_ctx *ctx);
+void git_hash_final_oid(struct object_id *oid, struct git_hash_ctx *ctx);
 const struct git_hash_algo *hash_algo_ptr_by_number(uint32_t algo);
+struct git_hash_ctx *git_hash_alloc(void);
+void git_hash_free(struct git_hash_ctx *ctx);
 /*
  * Return a GIT_HASH_* constant based on the name.  Returns GIT_HASH_UNKNOWN if
  * the name doesn't match a known algorithm.

^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH v2 11/15] rust: add a build.rs script for tests
  2025-11-17 22:16 ` [PATCH v2 00/15] " brian m. carlson
                     ` (9 preceding siblings ...)
  2025-11-17 22:16   ` [PATCH v2 10/15] hash: expose hash context functions to Rust brian m. carlson
@ 2025-11-17 22:16   ` brian m. carlson
  2025-11-17 22:16   ` [PATCH v2 12/15] rust: add functionality to hash an object brian m. carlson
                     ` (3 subsequent siblings)
  14 siblings, 0 replies; 101+ messages in thread
From: brian m. carlson @ 2025-11-17 22:16 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren

Cargo uses the build.rs script to determine how to compile and link a
binary.  The only binary we're generating, however, is for our tests,
but in a future commit, we're going to link against libgit.a for some
functionality and we'll need to make sure the test binaries are
complete.

Add a build.rs file for this case and specify the files we're going to
be linking against.  Because we cannot specify different dependencies
when building our static library versus our tests, update the Makefile
to specify these dependencies for our static library to avoid race
conditions during build.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 Makefile |  2 +-
 build.rs | 17 +++++++++++++++++
 2 files changed, 18 insertions(+), 1 deletion(-)
 create mode 100644 build.rs

diff --git a/Makefile b/Makefile
index e1d0ae3691..4211d7622a 100644
--- a/Makefile
+++ b/Makefile
@@ -2964,7 +2964,7 @@ scalar$X: scalar.o GIT-LDFLAGS $(GITLIBS)
 $(LIB_FILE): $(LIB_OBJS)
 	$(QUIET_AR)$(RM) $@ && $(AR) $(ARFLAGS) $@ $^
 
-$(RUST_LIB): Cargo.toml $(RUST_SOURCES)
+$(RUST_LIB): Cargo.toml $(RUST_SOURCES) $(LIB_FILE)
 	$(QUIET_CARGO)cargo build $(CARGO_ARGS)
 
 .PHONY: rust
diff --git a/build.rs b/build.rs
new file mode 100644
index 0000000000..3724b3a930
--- /dev/null
+++ b/build.rs
@@ -0,0 +1,17 @@
+// This program is free software; you can redistribute it and/or modify
+// it under the terms of the GNU General Public License as published by
+// the Free Software Foundation: version 2 of the License, dated June 1991.
+//
+// This program is distributed in the hope that it will be useful,
+// but WITHOUT ANY WARRANTY; without even the implied warranty of
+// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+// GNU General Public License for more details.
+//
+// You should have received a copy of the GNU General Public License along
+// with this program; if not, see <https://www.gnu.org/licenses/>.
+
+fn main() {
+    println!("cargo:rustc-link-search=.");
+    println!("cargo:rustc-link-lib=git");
+    println!("cargo:rustc-link-lib=z");
+}

^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH v2 12/15] rust: add functionality to hash an object
  2025-11-17 22:16 ` [PATCH v2 00/15] " brian m. carlson
                     ` (10 preceding siblings ...)
  2025-11-17 22:16   ` [PATCH v2 11/15] rust: add a build.rs script for tests brian m. carlson
@ 2025-11-17 22:16   ` brian m. carlson
  2025-11-17 22:16   ` [PATCH v2 13/15] rust: add a new binary object map format brian m. carlson
                     ` (2 subsequent siblings)
  14 siblings, 0 replies; 101+ messages in thread
From: brian m. carlson @ 2025-11-17 22:16 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren

In a future commit, we'll want to hash some data when dealing with an
object map.  Let's make this easy by creating a structure to hash
objects and calling into the C functions as necessary to perform the
hashing.  For now, we only implement safe hashing, but in the future we
could add unsafe hashing if we want.  Implement Clone and Drop to
appropriately manage our memory.  Additionally implement Write to make
it easy to use with other formats that implement this trait.

While we're at it, add some tests for the various hashing cases.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 src/hash.rs | 143 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 142 insertions(+), 1 deletion(-)

diff --git a/src/hash.rs b/src/hash.rs
index e1fa568661..dea2998de4 100644
--- a/src/hash.rs
+++ b/src/hash.rs
@@ -12,6 +12,7 @@
 
 use std::error::Error;
 use std::fmt::{self, Debug, Display};
+use std::io::{self, Write};
 use std::os::raw::c_void;
 
 pub const GIT_MAX_RAWSZ: usize = 32;
@@ -111,6 +112,100 @@ impl Debug for ObjectID {
     }
 }
 
+/// A trait to implement hashing with a cryptographic algorithm.
+pub trait CryptoDigest {
+    /// Return true if this digest is safe for use with untrusted data, false otherwise.
+    fn is_safe(&self) -> bool;
+
+    /// Update the digest with the specified data.
+    fn update(&mut self, data: &[u8]);
+
+    /// Return an object ID, consuming the hasher.
+    fn into_oid(self) -> ObjectID;
+
+    /// Return a hash as a `Vec`, consuming the hasher.
+    fn into_vec(self) -> Vec<u8>;
+}
+
+/// A structure to hash data with a cryptographic hash algorithm.
+///
+/// Instances of this class are safe for use with untrusted data, provided Git has been compiled
+/// with a collision-detecting implementation of SHA-1.
+pub struct CryptoHasher {
+    algo: HashAlgorithm,
+    ctx: *mut c_void,
+}
+
+impl CryptoHasher {
+    /// Create a new hasher with the algorithm specified with `algo`.
+    ///
+    /// This hasher is safe to use on untrusted data.  If SHA-1 is selected and Git was compiled
+    /// with a collision-detecting implementation of SHA-1, then this function will use that
+    /// implementation and detect any attempts at a collision.
+    pub fn new(algo: HashAlgorithm) -> Self {
+        let ctx = unsafe { c::git_hash_alloc() };
+        unsafe { c::git_hash_init(ctx, algo.hash_algo_ptr()) };
+        Self { algo, ctx }
+    }
+}
+
+impl CryptoDigest for CryptoHasher {
+    /// Return true if this digest is safe for use with untrusted data, false otherwise.
+    fn is_safe(&self) -> bool {
+        true
+    }
+
+    /// Update the hasher with the specified data.
+    fn update(&mut self, data: &[u8]) {
+        unsafe { c::git_hash_update(self.ctx, data.as_ptr() as *const c_void, data.len()) };
+    }
+
+    /// Return an object ID, consuming the hasher.
+    fn into_oid(self) -> ObjectID {
+        let mut oid = ObjectID {
+            hash: [0u8; 32],
+            algo: self.algo as u32,
+        };
+        unsafe { c::git_hash_final_oid(&mut oid as *mut ObjectID as *mut c_void, self.ctx) };
+        oid
+    }
+
+    /// Return a hash as a `Vec`, consuming the hasher.
+    fn into_vec(self) -> Vec<u8> {
+        let mut v = vec![0u8; self.algo.raw_len()];
+        unsafe { c::git_hash_final(v.as_mut_ptr(), self.ctx) };
+        v
+    }
+}
+
+impl Clone for CryptoHasher {
+    fn clone(&self) -> Self {
+        let ctx = unsafe { c::git_hash_alloc() };
+        unsafe { c::git_hash_clone(ctx, self.ctx) };
+        Self {
+            algo: self.algo,
+            ctx,
+        }
+    }
+}
+
+impl Drop for CryptoHasher {
+    fn drop(&mut self) {
+        unsafe { c::git_hash_free(self.ctx) };
+    }
+}
+
+impl Write for CryptoHasher {
+    fn write(&mut self, data: &[u8]) -> io::Result<usize> {
+        self.update(data);
+        Ok(data.len())
+    }
+
+    fn flush(&mut self) -> io::Result<()> {
+        Ok(())
+    }
+}
+
 /// A hash algorithm,
 #[repr(C)]
 #[derive(Debug, Copy, Clone, Ord, PartialOrd, Eq, PartialEq)]
@@ -239,6 +334,11 @@ impl HashAlgorithm {
     pub fn hash_algo_ptr(self) -> *const c_void {
         unsafe { c::hash_algo_ptr_by_number(self as u32) }
     }
+
+    /// Create a hasher for this algorithm.
+    pub fn hasher(self) -> CryptoHasher {
+        CryptoHasher::new(self)
+    }
 }
 
 pub mod c {
@@ -246,12 +346,21 @@ pub mod c {
 
     extern "C" {
         pub fn hash_algo_ptr_by_number(n: u32) -> *const c_void;
+        pub fn unsafe_hash_algo(algop: *const c_void) -> *const c_void;
+        pub fn git_hash_alloc() -> *mut c_void;
+        pub fn git_hash_free(ctx: *mut c_void);
+        pub fn git_hash_init(dst: *mut c_void, algop: *const c_void);
+        pub fn git_hash_clone(dst: *mut c_void, src: *const c_void);
+        pub fn git_hash_update(ctx: *mut c_void, inp: *const c_void, len: usize);
+        pub fn git_hash_final(hash: *mut u8, ctx: *mut c_void);
+        pub fn git_hash_final_oid(hash: *mut c_void, ctx: *mut c_void);
     }
 }
 
 #[cfg(test)]
 mod tests {
-    use super::HashAlgorithm;
+    use super::{CryptoDigest, HashAlgorithm, ObjectID};
+    use std::io::Write;
 
     fn all_algos() -> &'static [HashAlgorithm] {
         &[HashAlgorithm::SHA1, HashAlgorithm::SHA256]
@@ -322,4 +431,36 @@ mod tests {
             assert_eq!(format!("{:?}", oid), *debug);
         }
     }
+
+    #[test]
+    fn hasher_works_correctly() {
+        for algo in all_algos() {
+            let tests: &[(&[u8], &ObjectID)] = &[
+                (b"blob 0\0", algo.empty_blob()),
+                (b"tree 0\0", algo.empty_tree()),
+            ];
+            for (data, oid) in tests {
+                let mut h = algo.hasher();
+                assert!(h.is_safe());
+                // Test that this works incrementally.
+                h.update(&data[0..2]);
+                h.update(&data[2..]);
+
+                let h2 = h.clone();
+
+                let actual_oid = h.into_oid();
+                assert_eq!(**oid, actual_oid);
+
+                let v = h2.into_vec();
+                assert_eq!((*oid).as_slice().unwrap(), &v);
+
+                let mut h = algo.hasher();
+                h.write_all(&data[0..2]).unwrap();
+                h.write_all(&data[2..]).unwrap();
+
+                let actual_oid = h.into_oid();
+                assert_eq!(**oid, actual_oid);
+            }
+        }
+    }
 }

^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH v2 13/15] rust: add a new binary object map format
  2025-11-17 22:16 ` [PATCH v2 00/15] " brian m. carlson
                     ` (11 preceding siblings ...)
  2025-11-17 22:16   ` [PATCH v2 12/15] rust: add functionality to hash an object brian m. carlson
@ 2025-11-17 22:16   ` brian m. carlson
  2025-11-17 22:16   ` [PATCH v2 14/15] rust: add a small wrapper around the hashfile code brian m. carlson
  2025-11-17 22:16   ` [PATCH v2 15/15] object-file-convert: always make sure object ID algo is valid brian m. carlson
  14 siblings, 0 replies; 101+ messages in thread
From: brian m. carlson @ 2025-11-17 22:16 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren

Our current loose object format has a few problems.  First, it is not
efficient: the list of object IDs is not sorted and even if it were,
there would not be an efficient way to look up objects in both
algorithms.

Second, we need to store mappings for things which are not technically
loose objects but are not packed objects, either, and so cannot be
stored in a pack index.  These kinds of things include shallows, their
parents, and their trees, as well as submodules. Yet we also need to
implement a sensible way to store the kind of object so that we can
prune unneeded entries.  For instance, if the user has updated the
shallows, we can remove the old values.

For these reasons, introduce a new binary object map format.  The
careful reader will notice that it resembles very closely the pack index
v3 format.  Add an in-memory object map as well, and allow writing to a
batched map, which can then be written later as one of the binary object
maps.  Include several tests for round tripping and data lookup across
algorithms.

Note that the use of this code elsewhere in Git will involve some C code
and some C-compatible code in Rust that will be introduced in a future
commit.  Thus, for example, we ignore the fact that if there is no
current batch and the caller asks for data to be written, this code does
nothing, mostly because this code also does not involve itself with
opening or manipulating files.  The C code that we will add later will
implement this functionality at a higher level and take care of this,
since the code which is necessary for writing to the object store is
deeply involved with our C abstractions and it would require extensive
work (which would not be especially valuable at this point) to port
those to Rust.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 Documentation/gitformat-loose.adoc |  78 +++
 Makefile                           |   1 +
 src/lib.rs                         |   1 +
 src/loose.rs                       | 913 +++++++++++++++++++++++++++++
 src/meson.build                    |   1 +
 5 files changed, 994 insertions(+)
 create mode 100644 src/loose.rs

diff --git a/Documentation/gitformat-loose.adoc b/Documentation/gitformat-loose.adoc
index 947993663e..b0b569761b 100644
--- a/Documentation/gitformat-loose.adoc
+++ b/Documentation/gitformat-loose.adoc
@@ -10,6 +10,7 @@ SYNOPSIS
 --------
 [verse]
 $GIT_DIR/objects/[0-9a-f][0-9a-f]/*
+$GIT_DIR/objects/object-map/map-*.map
 
 DESCRIPTION
 -----------
@@ -48,6 +49,83 @@ stored under
 Similarly, a blob containing the contents `abc` would have the uncompressed
 data of `blob 3\0abc`.
 
+== Loose object mapping
+
+When the `compatObjectFormat` option is used, Git needs to store a mapping
+between the repository's main algorithm and the compatibility algorithm for
+loose objects as well as some auxiliary information.
+
+The mapping consists of a set of files under `$GIT_DIR/objects/object-map`
+ending in `.map`.  The portion of the filename before the extension is that of
+the main hash checksum (that is, the one specified in
+`extensions.objectformat`) in hex format.
+
+`git gc` will repack existing entries into one file, removing any unnecessary
+objects, such as obsolete shallow entries or loose objects that have been
+packed.
+
+The file format is as follows.  All values are in network byte order and all
+4-byte and 8-byte values must be 4-byte aligned in the file, so the NUL padding
+may be required in some cases.  Git always uses the smallest number of NUL
+bytes (including zero) that is required for the padding in order to make
+writing files deterministic.
+
+- A header appears at the beginning and consists of the following:
+	* A 4-byte mapping signature: `LMAP`
+	* 4-byte version number: 1
+	* 4-byte length of the header section (including reserved entries but
+		excluding any NUL padding).
+	* 4-byte number of objects declared in this map file.
+	* 4-byte number of object formats declared in this map file.
+	* For each object format:
+		** 4-byte format identifier (e.g., `sha1` for SHA-1)
+		** 4-byte length in bytes of shortened object names (that is, prefixes of
+			 the full object names). This is the shortest possible length needed to
+			 make names in the shortened object name table unambiguous.
+		** 8-byte integer, recording where tables relating to this format
+		are stored in this index file, as an offset from the beginning.
+	* 8-byte offset to the trailer from the beginning of this file.
+	* The remainder of the header section is reserved for future use.
+		Readers must ignore unrecognized data here.
+- Zero or more NUL bytes.  These are used to improve the alignment of the
+	4-byte quantities below.
+- Tables for the first object format:
+	* A sorted table of shortened object names.  These are prefixes of the names
+		of all objects in this file, packed together to reduce the cache footprint
+		of the binary search for a specific object name.
+  * A sorted table of full object names.
+	* A table of 4-byte metadata values.
+- Zero or more NUL bytes.
+- Tables for subsequent object formats:
+	* A sorted table of shortened object names.  These are prefixes of the names
+		of all objects in this file, packed together without offset values to
+		reduce the cache footprint of the binary search for a specific object name.
+	* A table of full object names in the order specified by the first object format.
+	* A table of 4-byte values mapping object name order to the order of the
+		first object format. For an object in the table of sorted shortened object
+		names, the value at the corresponding index in this table is the index in
+		the previous table for that same object.
+	* Zero or more NUL bytes.
+- The trailer consists of the following:
+	* Hash checksum of all of the above using the main hash.
+
+The lower six bits of each metadata table contain a type field indicating the
+reason that this object is stored:
+
+0::
+	Reserved.
+1::
+	This object is stored as a loose object in the repository.
+2::
+	This object is a shallow entry.  The mapping refers to a shallow value
+	returned by a remote server.
+3::
+	This object is a submodule entry.  The mapping refers to the commit stored
+	representing a submodule.
+
+Other data may be stored in this field in the future.  Bits that are not used
+must be zero.
+
 GIT
 ---
 Part of the linkgit:git[1] suite
diff --git a/Makefile b/Makefile
index 4211d7622a..40785c14fd 100644
--- a/Makefile
+++ b/Makefile
@@ -1536,6 +1536,7 @@ UNIT_TEST_OBJS += $(UNIT_TEST_DIR)/test-lib.o
 
 RUST_SOURCES += src/hash.rs
 RUST_SOURCES += src/lib.rs
+RUST_SOURCES += src/loose.rs
 RUST_SOURCES += src/varint.rs
 
 GIT-VERSION-FILE: FORCE
diff --git a/src/lib.rs b/src/lib.rs
index cf7c962509..442f9433dc 100644
--- a/src/lib.rs
+++ b/src/lib.rs
@@ -1,2 +1,3 @@
 pub mod hash;
+pub mod loose;
 pub mod varint;
diff --git a/src/loose.rs b/src/loose.rs
new file mode 100644
index 0000000000..24accf9c33
--- /dev/null
+++ b/src/loose.rs
@@ -0,0 +1,913 @@
+// This program is free software; you can redistribute it and/or modify
+// it under the terms of the GNU General Public License as published by
+// the Free Software Foundation: version 2 of the License, dated June 1991.
+//
+// This program is distributed in the hope that it will be useful,
+// but WITHOUT ANY WARRANTY; without even the implied warranty of
+// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+// GNU General Public License for more details.
+//
+// You should have received a copy of the GNU General Public License along
+// with this program; if not, see <https://www.gnu.org/licenses/>.
+
+use crate::hash::{HashAlgorithm, ObjectID, GIT_MAX_RAWSZ};
+use std::collections::BTreeMap;
+use std::convert::TryInto;
+use std::io::{self, Write};
+
+/// The type of object stored in the map.
+///
+/// If this value is `Reserved`, then it is never written to disk and is used primarily to store
+/// certain hard-coded objects, like the empty tree, empty blob, or null object ID.
+///
+/// If this value is `LooseObject`, then this represents a loose object.  `Shallow` represents a
+/// shallow commit, its parent, or its tree.  `Submodule` represents a submodule commit.
+#[repr(C)]
+#[derive(Debug, Clone, Copy, Ord, PartialOrd, Eq, PartialEq)]
+pub enum MapType {
+    Reserved = 0,
+    LooseObject = 1,
+    Shallow = 2,
+    Submodule = 3,
+}
+
+impl MapType {
+    pub fn from_u32(n: u32) -> Option<MapType> {
+        match n {
+            0 => Some(Self::Reserved),
+            1 => Some(Self::LooseObject),
+            2 => Some(Self::Shallow),
+            3 => Some(Self::Submodule),
+            _ => None,
+        }
+    }
+}
+
+/// The value of an object stored in a `ObjectMemoryMap`.
+///
+/// This keeps the object ID to which the key is mapped and its kind together.
+struct MappedObject {
+    oid: ObjectID,
+    kind: MapType,
+}
+
+/// Memory storage for a loose object.
+struct ObjectMemoryMap {
+    to_compat: BTreeMap<ObjectID, MappedObject>,
+    to_storage: BTreeMap<ObjectID, MappedObject>,
+    compat: HashAlgorithm,
+    storage: HashAlgorithm,
+}
+
+impl ObjectMemoryMap {
+    /// Create a new `ObjectMemoryMap`.
+    ///
+    /// The storage and compatibility `HashAlgorithm` instances are used to store the object IDs in
+    /// the correct map.
+    fn new(storage: HashAlgorithm, compat: HashAlgorithm) -> Self {
+        Self {
+            to_compat: BTreeMap::new(),
+            to_storage: BTreeMap::new(),
+            compat,
+            storage,
+        }
+    }
+
+    fn len(&self) -> usize {
+        self.to_compat.len()
+    }
+
+    /// Write this map to an interface implementing `std::io::Write`.
+    fn write<W: Write>(&self, wrtr: W) -> io::Result<()> {
+        const VERSION_NUMBER: u32 = 1;
+        const NUM_OBJECT_FORMATS: u32 = 2;
+        const PADDING: [u8; 4] = [0u8; 4];
+
+        let mut wrtr = wrtr;
+        let header_size: u32 = (4 * 5) + (4 + 4 + 8) * NUM_OBJECT_FORMATS + 8;
+
+        wrtr.write_all(b"LMAP")?;
+        wrtr.write_all(&VERSION_NUMBER.to_be_bytes())?;
+        wrtr.write_all(&header_size.to_be_bytes())?;
+        wrtr.write_all(&(self.to_compat.len() as u32).to_be_bytes())?;
+        wrtr.write_all(&NUM_OBJECT_FORMATS.to_be_bytes())?;
+
+        let storage_short_len = self.find_short_name_len(&self.to_compat, self.storage);
+        let compat_short_len = self.find_short_name_len(&self.to_storage, self.compat);
+
+        let storage_npadding = Self::required_nul_padding(self.to_compat.len(), storage_short_len);
+        let compat_npadding = Self::required_nul_padding(self.to_compat.len(), compat_short_len);
+
+        let mut offset: u64 = header_size as u64;
+
+        for (algo, len, npadding) in &[
+            (self.storage, storage_short_len, storage_npadding),
+            (self.compat, compat_short_len, compat_npadding),
+        ] {
+            wrtr.write_all(&algo.format_id().to_be_bytes())?;
+            wrtr.write_all(&(*len as u32).to_be_bytes())?;
+
+            offset += *npadding;
+            wrtr.write_all(&offset.to_be_bytes())?;
+
+            offset += self.to_compat.len() as u64 * (*len as u64 + algo.raw_len() as u64 + 4);
+        }
+
+        wrtr.write_all(&offset.to_be_bytes())?;
+
+        let order_map: BTreeMap<&ObjectID, usize> = self
+            .to_compat
+            .keys()
+            .enumerate()
+            .map(|(i, oid)| (oid, i))
+            .collect();
+
+        wrtr.write_all(&PADDING[0..storage_npadding as usize])?;
+        for oid in self.to_compat.keys() {
+            wrtr.write_all(&oid.as_slice().unwrap()[0..storage_short_len])?;
+        }
+        for oid in self.to_compat.keys() {
+            wrtr.write_all(oid.as_slice().unwrap())?;
+        }
+        for meta in self.to_compat.values() {
+            wrtr.write_all(&(meta.kind as u32).to_be_bytes())?;
+        }
+
+        wrtr.write_all(&PADDING[0..compat_npadding as usize])?;
+        for oid in self.to_storage.keys() {
+            wrtr.write_all(&oid.as_slice().unwrap()[0..compat_short_len])?;
+        }
+        for meta in self.to_compat.values() {
+            wrtr.write_all(meta.oid.as_slice().unwrap())?;
+        }
+        for meta in self.to_storage.values() {
+            wrtr.write_all(&(order_map[&meta.oid] as u32).to_be_bytes())?;
+        }
+
+        Ok(())
+    }
+
+    fn required_nul_padding(nitems: usize, short_len: usize) -> u64 {
+        let shortened_table_len = nitems as u64 * short_len as u64;
+        let misalignment = shortened_table_len & 3;
+        // If the value is 0, return 0; otherwise, return the difference from 4.
+        (4 - misalignment) & 3
+    }
+
+    fn last_matching_offset(a: &ObjectID, b: &ObjectID, algop: HashAlgorithm) -> usize {
+        for i in 0..=algop.raw_len() {
+            if a.hash[i] != b.hash[i] {
+                return i;
+            }
+        }
+        algop.raw_len()
+    }
+
+    fn find_short_name_len(
+        &self,
+        map: &BTreeMap<ObjectID, MappedObject>,
+        algop: HashAlgorithm,
+    ) -> usize {
+        if map.len() <= 1 {
+            return 1;
+        }
+        let mut len = 1;
+        let mut iter = map.keys();
+        let mut cur = match iter.next() {
+            Some(cur) => cur,
+            None => return len,
+        };
+        for item in iter {
+            let offset = Self::last_matching_offset(cur, item, algop);
+            if offset >= len {
+                len = offset + 1;
+            }
+            cur = item;
+        }
+        if len > algop.raw_len() {
+            algop.raw_len()
+        } else {
+            len
+        }
+    }
+}
+
+struct ObjectFormatData {
+    data_off: usize,
+    shortened_len: usize,
+    full_off: usize,
+    mapping_off: Option<usize>,
+}
+
+pub struct MmapedObjectMapIter<'a> {
+    offset: usize,
+    algos: Vec<HashAlgorithm>,
+    source: &'a MmapedObjectMap<'a>,
+}
+
+impl<'a> Iterator for MmapedObjectMapIter<'a> {
+    type Item = Vec<ObjectID>;
+
+    fn next(&mut self) -> Option<Self::Item> {
+        if self.offset >= self.source.nitems {
+            return None;
+        }
+        let offset = self.offset;
+        self.offset += 1;
+        let v: Vec<ObjectID> = self
+            .algos
+            .iter()
+            .cloned()
+            .filter_map(|algo| self.source.oid_from_offset(offset, algo))
+            .collect();
+        if v.len() != self.algos.len() {
+            return None;
+        }
+        Some(v)
+    }
+}
+
+#[allow(dead_code)]
+pub struct MmapedObjectMap<'a> {
+    memory: &'a [u8],
+    nitems: usize,
+    meta_off: usize,
+    obj_formats: BTreeMap<HashAlgorithm, ObjectFormatData>,
+    main_algo: HashAlgorithm,
+}
+
+#[derive(Debug)]
+#[allow(dead_code)]
+enum MmapedParseError {
+    HeaderTooSmall,
+    InvalidSignature,
+    InvalidVersion,
+    UnknownAlgorithm,
+    OffsetTooLarge,
+    TooFewObjectFormats,
+    UnalignedData,
+    InvalidTrailerOffset,
+}
+
+#[allow(dead_code)]
+impl<'a> MmapedObjectMap<'a> {
+    fn new(
+        slice: &'a [u8],
+        hash_algo: HashAlgorithm,
+    ) -> Result<MmapedObjectMap<'a>, MmapedParseError> {
+        let object_format_header_size = 4 + 4 + 8;
+        let trailer_offset_size = 8;
+        let header_size: usize =
+            4 + 4 + 4 + 4 + 4 + object_format_header_size * 2 + trailer_offset_size;
+        if slice.len() < header_size {
+            return Err(MmapedParseError::HeaderTooSmall);
+        }
+        if slice[0..4] != *b"LMAP" {
+            return Err(MmapedParseError::InvalidSignature);
+        }
+        if Self::u32_at_offset(slice, 4) != 1 {
+            return Err(MmapedParseError::InvalidVersion);
+        }
+        let _ = Self::u32_at_offset(slice, 8) as usize;
+        let nitems = Self::u32_at_offset(slice, 12) as usize;
+        let nobj_formats = Self::u32_at_offset(slice, 16) as usize;
+        if nobj_formats < 2 {
+            return Err(MmapedParseError::TooFewObjectFormats);
+        }
+        let mut offset = 20;
+        let mut meta_off = None;
+        let mut data = BTreeMap::new();
+        for i in 0..nobj_formats {
+            if offset + object_format_header_size + trailer_offset_size > slice.len() {
+                return Err(MmapedParseError::HeaderTooSmall);
+            }
+            let format_id = Self::u32_at_offset(slice, offset);
+            let shortened_len = Self::u32_at_offset(slice, offset + 4) as usize;
+            let data_off = Self::u64_at_offset(slice, offset + 8);
+
+            let algo = HashAlgorithm::from_format_id(format_id)
+                .ok_or(MmapedParseError::UnknownAlgorithm)?;
+            let data_off: usize = data_off
+                .try_into()
+                .map_err(|_| MmapedParseError::OffsetTooLarge)?;
+
+            // Every object format must have these entries.
+            let shortened_table_len = shortened_len
+                .checked_mul(nitems)
+                .ok_or(MmapedParseError::OffsetTooLarge)?;
+            let full_off = data_off
+                .checked_add(shortened_table_len)
+                .ok_or(MmapedParseError::OffsetTooLarge)?;
+            Self::verify_aligned(full_off)?;
+            Self::verify_valid(slice, full_off as u64)?;
+
+            let full_length = algo
+                .raw_len()
+                .checked_mul(nitems)
+                .ok_or(MmapedParseError::OffsetTooLarge)?;
+            let off = full_length
+                .checked_add(full_off)
+                .ok_or(MmapedParseError::OffsetTooLarge)?;
+            Self::verify_aligned(off)?;
+            Self::verify_valid(slice, off as u64)?;
+
+            // This is for the metadata for the first object format and for the order mapping for
+            // other object formats.
+            let meta_size = nitems
+                .checked_mul(4)
+                .ok_or(MmapedParseError::OffsetTooLarge)?;
+            let meta_end = off
+                .checked_add(meta_size)
+                .ok_or(MmapedParseError::OffsetTooLarge)?;
+            Self::verify_valid(slice, meta_end as u64)?;
+
+            let mut mapping_off = None;
+            if i == 0 {
+                meta_off = Some(off);
+            } else {
+                mapping_off = Some(off);
+            }
+
+            data.insert(
+                algo,
+                ObjectFormatData {
+                    data_off,
+                    shortened_len,
+                    full_off,
+                    mapping_off,
+                },
+            );
+            offset += object_format_header_size;
+        }
+        let trailer = Self::u64_at_offset(slice, offset);
+        Self::verify_aligned(trailer as usize)?;
+        Self::verify_valid(slice, trailer)?;
+        let end = trailer
+            .checked_add(hash_algo.raw_len() as u64)
+            .ok_or(MmapedParseError::OffsetTooLarge)?;
+        if end != slice.len() as u64 {
+            return Err(MmapedParseError::InvalidTrailerOffset);
+        }
+        match meta_off {
+            Some(meta_off) => Ok(MmapedObjectMap {
+                memory: slice,
+                nitems,
+                meta_off,
+                obj_formats: data,
+                main_algo: hash_algo,
+            }),
+            None => Err(MmapedParseError::TooFewObjectFormats),
+        }
+    }
+
+    fn iter(&self) -> MmapedObjectMapIter<'_> {
+        let mut algos = Vec::with_capacity(self.obj_formats.len());
+        algos.push(self.main_algo);
+        for algo in self.obj_formats.keys().cloned() {
+            if algo != self.main_algo {
+                algos.push(algo);
+            }
+        }
+        MmapedObjectMapIter {
+            offset: 0,
+            algos,
+            source: self,
+        }
+    }
+
+    /// Treats `sl` as if it were a set of slices of `wanted.len()` bytes, and searches for
+    /// `wanted` within it.
+    ///
+    /// If found, returns the offset of the subslice in `sl`.
+    ///
+    /// ```
+    /// let sl = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9];
+    ///
+    /// assert_eq!(MmapedObjectMap::binary_search_slice(sl, &[2, 3]), Some(1));
+    /// assert_eq!(MmapedObjectMap::binary_search_slice(sl, &[6, 7]), Some(4));
+    /// assert_eq!(MmapedObjectMap::binary_search_slice(sl, &[1, 2]), None);
+    /// assert_eq!(MmapedObjectMap::binary_search_slice(sl, &[10, 20]), None);
+    /// ```
+    fn binary_search_slice(sl: &[u8], wanted: &[u8]) -> Option<usize> {
+        let len = wanted.len();
+        let res = sl.binary_search_by(|item| {
+            // We would like element_offset, but that is currently nightly only.  Instead, do a
+            // pointer subtraction to find the index.
+            let index = unsafe { (item as *const u8).offset_from(sl.as_ptr()) } as usize;
+            // Now we have the index of this object.  Round it down to the nearest full-sized
+            // chunk to find the actual offset where this starts.
+            let index = index - (index % len);
+            // Compute the comparison of that value instead, which will provide the expected
+            // result.
+            sl[index..index + wanted.len()].cmp(wanted)
+        });
+        res.ok().map(|offset| offset / len)
+    }
+
+    /// Look up `oid` in the map in order to convert it to `algo`.
+    ///
+    /// If this object is in the map, return the offset in the table for the main algorithm.
+    fn look_up_object(&self, oid: &ObjectID) -> Option<usize> {
+        let oid_algo = HashAlgorithm::from_u32(oid.algo)?;
+        let params = self.obj_formats.get(&oid_algo)?;
+        let short_table =
+            &self.memory[params.data_off..params.data_off + (params.shortened_len * self.nitems)];
+        let index = Self::binary_search_slice(
+            short_table,
+            &oid.as_slice().unwrap()[0..params.shortened_len],
+        )?;
+        match params.mapping_off {
+            Some(from_off) => {
+                // oid is in a compatibility algorithm.  Find the mapping index.
+                let mapped = Self::u32_at_offset(self.memory, from_off + index * 4) as usize;
+                if mapped >= self.nitems {
+                    return None;
+                }
+                let oid_offset = params.full_off + mapped * oid_algo.raw_len();
+                if self.memory[oid_offset..oid_offset + oid_algo.raw_len()]
+                    != *oid.as_slice().unwrap()
+                {
+                    return None;
+                }
+                Some(mapped)
+            }
+            None => {
+                // oid is in the main algorithm.  Find the object ID in the main map to confirm
+                // it's correct.
+                let oid_offset = params.full_off + index * oid_algo.raw_len();
+                if self.memory[oid_offset..oid_offset + oid_algo.raw_len()]
+                    != *oid.as_slice().unwrap()
+                {
+                    return None;
+                }
+                Some(index)
+            }
+        }
+    }
+
+    #[allow(dead_code)]
+    fn map_object(&self, oid: &ObjectID, algo: HashAlgorithm) -> Option<MappedObject> {
+        let main = self.look_up_object(oid)?;
+        let meta = MapType::from_u32(Self::u32_at_offset(self.memory, self.meta_off + (main * 4)))?;
+        Some(MappedObject {
+            oid: self.oid_from_offset(main, algo)?,
+            kind: meta,
+        })
+    }
+
+    fn map_oid(&self, oid: &ObjectID, algo: HashAlgorithm) -> Option<ObjectID> {
+        if algo as u32 == oid.algo {
+            return Some(oid.clone());
+        }
+
+        let main = self.look_up_object(oid)?;
+        self.oid_from_offset(main, algo)
+    }
+
+    fn oid_from_offset(&self, offset: usize, algo: HashAlgorithm) -> Option<ObjectID> {
+        let aparams = self.obj_formats.get(&algo)?;
+
+        let mut hash = [0u8; GIT_MAX_RAWSZ];
+        let len = algo.raw_len();
+        let oid_off = aparams.full_off + (offset * len);
+        hash[0..len].copy_from_slice(&self.memory[oid_off..oid_off + len]);
+        Some(ObjectID {
+            hash,
+            algo: algo as u32,
+        })
+    }
+
+    fn u32_at_offset(slice: &[u8], offset: usize) -> u32 {
+        u32::from_be_bytes(slice[offset..offset + 4].try_into().unwrap())
+    }
+
+    fn u64_at_offset(slice: &[u8], offset: usize) -> u64 {
+        u64::from_be_bytes(slice[offset..offset + 8].try_into().unwrap())
+    }
+
+    fn verify_aligned(offset: usize) -> Result<(), MmapedParseError> {
+        if (offset & 3) != 0 {
+            return Err(MmapedParseError::UnalignedData);
+        }
+        Ok(())
+    }
+
+    fn verify_valid(slice: &[u8], offset: u64) -> Result<(), MmapedParseError> {
+        if offset >= slice.len() as u64 {
+            return Err(MmapedParseError::OffsetTooLarge);
+        }
+        Ok(())
+    }
+}
+
+/// A map for loose and other non-packed object IDs that maps between a storage and compatibility
+/// mapping.
+///
+/// In addition to the in-memory option, there is an optional batched storage, which can be used to
+/// write objects to disk in an efficient way.
+pub struct ObjectMap {
+    mem: ObjectMemoryMap,
+    batch: Option<ObjectMemoryMap>,
+}
+
+impl ObjectMap {
+    /// Create a new `ObjectMap` with the given hash algorithms.
+    ///
+    /// This initializes the memory map to automatically map the empty tree, empty blob, and null
+    /// object ID.
+    pub fn new(storage: HashAlgorithm, compat: HashAlgorithm) -> Self {
+        let mut map = ObjectMemoryMap::new(storage, compat);
+        for (main, compat) in &[
+            (storage.empty_tree(), compat.empty_tree()),
+            (storage.empty_blob(), compat.empty_blob()),
+            (storage.null_oid(), compat.null_oid()),
+        ] {
+            map.to_storage.insert(
+                (*compat).clone(),
+                MappedObject {
+                    oid: (*main).clone(),
+                    kind: MapType::Reserved,
+                },
+            );
+            map.to_compat.insert(
+                (*main).clone(),
+                MappedObject {
+                    oid: (*compat).clone(),
+                    kind: MapType::Reserved,
+                },
+            );
+        }
+        Self {
+            mem: map,
+            batch: None,
+        }
+    }
+
+    pub fn hash_algo(&self) -> HashAlgorithm {
+        self.mem.storage
+    }
+
+    /// Start a batch for efficient writing.
+    ///
+    /// If there is already a batch started, this does nothing and the existing batch is retained.
+    pub fn start_batch(&mut self) {
+        if self.batch.is_none() {
+            self.batch = Some(ObjectMemoryMap::new(self.mem.storage, self.mem.compat));
+        }
+    }
+
+    pub fn batch_len(&self) -> Option<usize> {
+        self.batch.as_ref().map(|b| b.len())
+    }
+
+    /// If a batch exists, write it to the writer.
+    pub fn finish_batch<W: Write>(&mut self, w: W) -> io::Result<()> {
+        if let Some(txn) = self.batch.take() {
+            txn.write(w)?;
+        }
+        Ok(())
+    }
+
+    /// If a batch exists, write it to the writer.
+    pub fn abort_batch(&mut self) {
+        self.batch = None;
+    }
+
+    /// Return whether there is a batch already started.
+    ///
+    /// If you just want a batch to exist and don't care whether one has already been started, you
+    /// may simply call `start_batch` unconditionally.
+    pub fn has_batch(&self) -> bool {
+        self.batch.is_some()
+    }
+
+    /// Insert an object into the map.
+    ///
+    /// If `write` is true and there is a batch started, write the object into the batch as well as
+    /// into the memory map.
+    pub fn insert(&mut self, oid1: &ObjectID, oid2: &ObjectID, kind: MapType, write: bool) {
+        let (compat_oid, storage_oid) =
+            if HashAlgorithm::from_u32(oid1.algo) == Some(self.mem.compat) {
+                (oid1, oid2)
+            } else {
+                (oid2, oid1)
+            };
+        Self::insert_into(&mut self.mem, storage_oid, compat_oid, kind);
+        if write {
+            if let Some(ref mut batch) = self.batch {
+                Self::insert_into(batch, storage_oid, compat_oid, kind);
+            }
+        }
+    }
+
+    fn insert_into(
+        map: &mut ObjectMemoryMap,
+        storage: &ObjectID,
+        compat: &ObjectID,
+        kind: MapType,
+    ) {
+        map.to_compat.insert(
+            storage.clone(),
+            MappedObject {
+                oid: compat.clone(),
+                kind,
+            },
+        );
+        map.to_storage.insert(
+            compat.clone(),
+            MappedObject {
+                oid: storage.clone(),
+                kind,
+            },
+        );
+    }
+
+    #[allow(dead_code)]
+    fn map_object(&self, oid: &ObjectID, algo: HashAlgorithm) -> Option<&MappedObject> {
+        let map = if algo == self.mem.storage {
+            &self.mem.to_storage
+        } else {
+            &self.mem.to_compat
+        };
+        map.get(oid)
+    }
+
+    #[allow(dead_code)]
+    fn map_oid<'a, 'b: 'a>(
+        &'b self,
+        oid: &'a ObjectID,
+        algo: HashAlgorithm,
+    ) -> Option<&'a ObjectID> {
+        if algo as u32 == oid.algo {
+            return Some(oid);
+        }
+        let entry = self.map_object(oid, algo);
+        entry.map(|obj| &obj.oid)
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use super::{MapType, MmapedObjectMap, ObjectMap, ObjectMemoryMap};
+    use crate::hash::{CryptoDigest, CryptoHasher, HashAlgorithm, ObjectID};
+    use std::convert::TryInto;
+    use std::io::{self, Cursor, Write};
+
+    struct TrailingWriter {
+        curs: Cursor<Vec<u8>>,
+        hasher: CryptoHasher,
+    }
+
+    impl TrailingWriter {
+        fn new() -> Self {
+            Self {
+                curs: Cursor::new(Vec::new()),
+                hasher: CryptoHasher::new(HashAlgorithm::SHA256),
+            }
+        }
+
+        fn finalize(mut self) -> Vec<u8> {
+            let _ = self.hasher.flush();
+            let mut v = self.curs.into_inner();
+            v.extend(self.hasher.into_vec());
+            v
+        }
+    }
+
+    impl Write for TrailingWriter {
+        fn write(&mut self, data: &[u8]) -> io::Result<usize> {
+            self.hasher.write_all(data)?;
+            self.curs.write_all(data)?;
+            Ok(data.len())
+        }
+
+        fn flush(&mut self) -> io::Result<()> {
+            self.hasher.flush()?;
+            self.curs.flush()?;
+            Ok(())
+        }
+    }
+
+    fn sha1_oid(b: &[u8]) -> ObjectID {
+        assert_eq!(b.len(), 20);
+        let mut data = [0u8; 32];
+        data[0..20].copy_from_slice(b);
+        ObjectID {
+            hash: data,
+            algo: HashAlgorithm::SHA1 as u32,
+        }
+    }
+
+    fn sha256_oid(b: &[u8]) -> ObjectID {
+        assert_eq!(b.len(), 32);
+        ObjectID {
+            hash: b.try_into().unwrap(),
+            algo: HashAlgorithm::SHA256 as u32,
+        }
+    }
+
+    #[allow(clippy::type_complexity)]
+    fn test_entries() -> &'static [(&'static str, &'static [u8], &'static [u8], MapType, bool)] {
+        // These are all example blobs containing the content in the first argument.
+        &[
+            ("abc", b"\xf2\xba\x8f\x84\xab\x5c\x1b\xce\x84\xa7\xb4\x41\xcb\x19\x59\xcf\xc7\x09\x3b\x7f", b"\xc1\xcf\x6e\x46\x50\x77\x93\x0e\x88\xdc\x51\x36\x64\x1d\x40\x2f\x72\xa2\x29\xdd\xd9\x96\xf6\x27\xd6\x0e\x96\x39\xea\xba\x35\xa6", MapType::LooseObject, false),
+            ("def", b"\x0c\x00\x38\x32\xe7\xbf\xa9\xca\x8b\x5c\x20\x35\xc9\xbd\x68\x4a\x5f\x26\x23\xbc", b"\x8a\x90\x17\x26\x48\x4d\xb0\xf2\x27\x9f\x30\x8d\x58\x96\xd9\x6b\xf6\x3a\xd6\xde\x95\x7c\xa3\x8a\xdc\x33\x61\x68\x03\x6e\xf6\x63", MapType::Shallow, true),
+            ("ghi", b"\x45\xa8\x2e\x29\x5c\x52\x47\x31\x14\xc5\x7c\x18\xf4\xf5\x23\x68\xdf\x2a\x3c\xfd", b"\x6e\x47\x4c\x74\xf5\xd7\x78\x14\xc7\xf7\xf0\x7c\x37\x80\x07\x90\x53\x42\xaf\x42\x81\xe6\x86\x8d\x33\x46\x45\x4b\xb8\x63\xab\xc3", MapType::Submodule, false),
+            ("jkl", b"\x45\x32\x8c\x36\xff\x2e\x9b\x9b\x4e\x59\x2c\x84\x7d\x3f\x9a\x7f\xd9\xb3\xe7\x16", b"\xc3\xee\xf7\x54\xa2\x1e\xc6\x9d\x43\x75\xbe\x6f\x18\x47\x89\xa8\x11\x6f\xd9\x66\xfc\x67\xdc\x31\xd2\x11\x15\x42\xc8\xd5\xa0\xaf", MapType::LooseObject, true),
+        ]
+    }
+
+    fn test_map(write_all: bool) -> Box<ObjectMap> {
+        let mut map = Box::new(ObjectMap::new(HashAlgorithm::SHA256, HashAlgorithm::SHA1));
+
+        map.start_batch();
+
+        for (_blob_content, sha1, sha256, kind, swap) in test_entries() {
+            let s256 = sha256_oid(sha256);
+            let s1 = sha1_oid(sha1);
+            let write = write_all || (*kind as u32 & 2) == 0;
+            if *swap {
+                // Insert the item into the batch arbitrarily based on the type.  This tests that
+                // we can specify either order and we'll do the right thing.
+                map.insert(&s256, &s1, *kind, write);
+            } else {
+                map.insert(&s1, &s256, *kind, write);
+            }
+        }
+
+        map
+    }
+
+    #[test]
+    fn can_read_and_write_format() {
+        for full in &[true, false] {
+            let mut map = test_map(*full);
+            let mut wrtr = TrailingWriter::new();
+            map.finish_batch(&mut wrtr).unwrap();
+
+            assert!(!map.has_batch());
+
+            let data = wrtr.finalize();
+            MmapedObjectMap::new(&data, HashAlgorithm::SHA256).unwrap();
+        }
+    }
+
+    #[test]
+    fn looks_up_from_mmaped() {
+        let mut map = test_map(true);
+        let mut wrtr = TrailingWriter::new();
+        map.finish_batch(&mut wrtr).unwrap();
+
+        assert!(!map.has_batch());
+
+        let data = wrtr.finalize();
+        let entries = test_entries();
+        let map = MmapedObjectMap::new(&data, HashAlgorithm::SHA256).unwrap();
+
+        for (_, sha1, sha256, kind, _) in entries {
+            let s256 = sha256_oid(sha256);
+            let s1 = sha1_oid(sha1);
+
+            let res = map.map_object(&s256, HashAlgorithm::SHA1).unwrap();
+            assert_eq!(res.oid, s1);
+            assert_eq!(res.kind, *kind);
+            let res = map.map_oid(&s256, HashAlgorithm::SHA1).unwrap();
+            assert_eq!(res, s1);
+
+            let res = map.map_object(&s256, HashAlgorithm::SHA256).unwrap();
+            assert_eq!(res.oid, s256);
+            assert_eq!(res.kind, *kind);
+            let res = map.map_oid(&s256, HashAlgorithm::SHA256).unwrap();
+            assert_eq!(res, s256);
+
+            let res = map.map_object(&s1, HashAlgorithm::SHA256).unwrap();
+            assert_eq!(res.oid, s256);
+            assert_eq!(res.kind, *kind);
+            let res = map.map_oid(&s1, HashAlgorithm::SHA256).unwrap();
+            assert_eq!(res, s256);
+
+            let res = map.map_object(&s1, HashAlgorithm::SHA1).unwrap();
+            assert_eq!(res.oid, s1);
+            assert_eq!(res.kind, *kind);
+            let res = map.map_oid(&s1, HashAlgorithm::SHA1).unwrap();
+            assert_eq!(res, s1);
+        }
+
+        for octet in &[0x00u8, 0x6d, 0x6e, 0x8a, 0xff] {
+            let missing_oid = ObjectID {
+                hash: [*octet; 32],
+                algo: HashAlgorithm::SHA256 as u32,
+            };
+
+            assert!(map.map_object(&missing_oid, HashAlgorithm::SHA1).is_none());
+            assert!(map.map_oid(&missing_oid, HashAlgorithm::SHA1).is_none());
+
+            assert_eq!(
+                map.map_oid(&missing_oid, HashAlgorithm::SHA256).unwrap(),
+                missing_oid
+            );
+        }
+    }
+
+    #[test]
+    fn binary_searches_slices_correctly() {
+        let sl = &[
+            0, 1, 2, 15, 14, 13, 18, 10, 2, 20, 20, 20, 21, 21, 0, 21, 21, 1, 21, 21, 21, 21, 21,
+            22, 22, 23, 24,
+        ];
+
+        let expected: &[(&[u8], Option<usize>)] = &[
+            (&[0, 1, 2], Some(0)),
+            (&[15, 14, 13], Some(1)),
+            (&[18, 10, 2], Some(2)),
+            (&[20, 20, 20], Some(3)),
+            (&[21, 21, 0], Some(4)),
+            (&[21, 21, 1], Some(5)),
+            (&[21, 21, 21], Some(6)),
+            (&[21, 21, 22], Some(7)),
+            (&[22, 23, 24], Some(8)),
+            (&[2, 15, 14], None),
+            (&[0, 21, 21], None),
+            (&[21, 21, 23], None),
+            (&[22, 22, 23], None),
+            (&[0xff, 0xff, 0xff], None),
+            (&[0, 0, 0], None),
+        ];
+
+        for (wanted, value) in expected {
+            assert_eq!(MmapedObjectMap::binary_search_slice(sl, wanted), *value);
+        }
+    }
+
+    #[test]
+    fn looks_up_oid_correctly() {
+        let map = test_map(false);
+        let entries = test_entries();
+
+        let s256 = sha256_oid(entries[0].2);
+        let s1 = sha1_oid(entries[0].1);
+
+        let missing_oid = ObjectID {
+            hash: [0xffu8; 32],
+            algo: HashAlgorithm::SHA256 as u32,
+        };
+
+        let res = map.map_object(&s256, HashAlgorithm::SHA1).unwrap();
+        assert_eq!(res.oid, s1);
+        assert_eq!(res.kind, MapType::LooseObject);
+        let res = map.map_oid(&s256, HashAlgorithm::SHA1).unwrap();
+        assert_eq!(*res, s1);
+
+        let res = map.map_object(&s1, HashAlgorithm::SHA256).unwrap();
+        assert_eq!(res.oid, s256);
+        assert_eq!(res.kind, MapType::LooseObject);
+        let res = map.map_oid(&s1, HashAlgorithm::SHA256).unwrap();
+        assert_eq!(*res, s256);
+
+        assert!(map.map_object(&missing_oid, HashAlgorithm::SHA1).is_none());
+        assert!(map.map_oid(&missing_oid, HashAlgorithm::SHA1).is_none());
+
+        assert_eq!(
+            *map.map_oid(&missing_oid, HashAlgorithm::SHA256).unwrap(),
+            missing_oid
+        );
+    }
+
+    #[test]
+    fn looks_up_known_oids_correctly() {
+        let map = test_map(false);
+
+        let funcs: &[&dyn Fn(HashAlgorithm) -> &'static ObjectID] = &[
+            &|h: HashAlgorithm| h.empty_tree(),
+            &|h: HashAlgorithm| h.empty_blob(),
+            &|h: HashAlgorithm| h.null_oid(),
+        ];
+
+        for f in funcs {
+            let s256 = f(HashAlgorithm::SHA256);
+            let s1 = f(HashAlgorithm::SHA1);
+
+            let res = map.map_object(s256, HashAlgorithm::SHA1).unwrap();
+            assert_eq!(res.oid, *s1);
+            assert_eq!(res.kind, MapType::Reserved);
+            let res = map.map_oid(s256, HashAlgorithm::SHA1).unwrap();
+            assert_eq!(*res, *s1);
+
+            let res = map.map_object(s1, HashAlgorithm::SHA256).unwrap();
+            assert_eq!(res.oid, *s256);
+            assert_eq!(res.kind, MapType::Reserved);
+            let res = map.map_oid(s1, HashAlgorithm::SHA256).unwrap();
+            assert_eq!(*res, *s256);
+        }
+    }
+
+    #[test]
+    fn nul_padding() {
+        assert_eq!(ObjectMemoryMap::required_nul_padding(1, 1), 3);
+        assert_eq!(ObjectMemoryMap::required_nul_padding(2, 1), 2);
+        assert_eq!(ObjectMemoryMap::required_nul_padding(3, 1), 1);
+        assert_eq!(ObjectMemoryMap::required_nul_padding(2, 2), 0);
+
+        assert_eq!(ObjectMemoryMap::required_nul_padding(39, 3), 3);
+    }
+}
diff --git a/src/meson.build b/src/meson.build
index c77041a3fa..1eea068519 100644
--- a/src/meson.build
+++ b/src/meson.build
@@ -1,6 +1,7 @@
 libgit_rs_sources = [
   'hash.rs',
   'lib.rs',
+  'loose.rs',
   'varint.rs',
 ]
 

^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH v2 14/15] rust: add a small wrapper around the hashfile code
  2025-11-17 22:16 ` [PATCH v2 00/15] " brian m. carlson
                     ` (12 preceding siblings ...)
  2025-11-17 22:16   ` [PATCH v2 13/15] rust: add a new binary object map format brian m. carlson
@ 2025-11-17 22:16   ` brian m. carlson
  2025-11-17 22:16   ` [PATCH v2 15/15] object-file-convert: always make sure object ID algo is valid brian m. carlson
  14 siblings, 0 replies; 101+ messages in thread
From: brian m. carlson @ 2025-11-17 22:16 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren

Our new binary object map code avoids needing to be intimately involved
with file handling by simply writing data to an object implement Write.
This makes it very easy to test by writing to a Cursor wrapping a Vec
for tests, and thus decouples it from intimate knowledge about how we
handle files.

However, we will actually want to write our data to an actual file,
since that's the most practical way to persist data.  Implement a
wrapper around the hashfile code that implements the Write trait so that
we can write our object map into a file.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 Makefile         |  1 +
 src/csum_file.rs | 81 ++++++++++++++++++++++++++++++++++++++++++++++++
 src/lib.rs       |  1 +
 src/meson.build  |  1 +
 4 files changed, 84 insertions(+)
 create mode 100644 src/csum_file.rs

diff --git a/Makefile b/Makefile
index 40785c14fd..b05709c5e9 100644
--- a/Makefile
+++ b/Makefile
@@ -1534,6 +1534,7 @@ CLAR_TEST_OBJS += $(UNIT_TEST_DIR)/unit-test.o
 
 UNIT_TEST_OBJS += $(UNIT_TEST_DIR)/test-lib.o
 
+RUST_SOURCES += src/csum_file.rs
 RUST_SOURCES += src/hash.rs
 RUST_SOURCES += src/lib.rs
 RUST_SOURCES += src/loose.rs
diff --git a/src/csum_file.rs b/src/csum_file.rs
new file mode 100644
index 0000000000..7f2c6c4fcb
--- /dev/null
+++ b/src/csum_file.rs
@@ -0,0 +1,81 @@
+// This program is free software; you can redistribute it and/or modify
+// it under the terms of the GNU General Public License as published by
+// the Free Software Foundation: version 2 of the License, dated June 1991.
+//
+// This program is distributed in the hope that it will be useful,
+// but WITHOUT ANY WARRANTY; without even the implied warranty of
+// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+// GNU General Public License for more details.
+//
+// You should have received a copy of the GNU General Public License along
+// with this program; if not, see <https://www.gnu.org/licenses/>.
+
+use crate::hash::{HashAlgorithm, GIT_MAX_RAWSZ};
+use std::ffi::CStr;
+use std::io::{self, Write};
+use std::os::raw::c_void;
+
+/// A writer that can write files identified by their hash or containing a trailing hash.
+pub struct HashFile {
+    ptr: *mut c_void,
+    algo: HashAlgorithm,
+}
+
+impl HashFile {
+    /// Create a new HashFile.
+    ///
+    /// The hash used will be `algo`, its name should be in `name`, and an open file descriptor
+    /// pointing to that file should be in `fd`.
+    pub fn new(algo: HashAlgorithm, fd: i32, name: &CStr) -> HashFile {
+        HashFile {
+            ptr: unsafe { c::hashfd(algo.hash_algo_ptr(), fd, name.as_ptr()) },
+            algo,
+        }
+    }
+
+    /// Finalize this HashFile instance.
+    ///
+    /// Returns the hash computed over the data.
+    pub fn finalize(self, component: u32, flags: u32) -> Vec<u8> {
+        let mut result = vec![0u8; GIT_MAX_RAWSZ];
+        unsafe { c::finalize_hashfile(self.ptr, result.as_mut_ptr(), component, flags) };
+        result.truncate(self.algo.raw_len());
+        result
+    }
+}
+
+impl Write for HashFile {
+    fn write(&mut self, data: &[u8]) -> io::Result<usize> {
+        for chunk in data.chunks(u32::MAX as usize) {
+            unsafe {
+                c::hashwrite(
+                    self.ptr,
+                    chunk.as_ptr() as *const c_void,
+                    chunk.len() as u32,
+                )
+            };
+        }
+        Ok(data.len())
+    }
+
+    fn flush(&mut self) -> io::Result<()> {
+        unsafe { c::hashflush(self.ptr) };
+        Ok(())
+    }
+}
+
+pub mod c {
+    use std::os::raw::{c_char, c_int, c_void};
+
+    extern "C" {
+        pub fn hashfd(algop: *const c_void, fd: i32, name: *const c_char) -> *mut c_void;
+        pub fn hashwrite(f: *mut c_void, data: *const c_void, len: u32);
+        pub fn hashflush(f: *mut c_void);
+        pub fn finalize_hashfile(
+            f: *mut c_void,
+            data: *mut u8,
+            component: u32,
+            flags: u32,
+        ) -> c_int;
+    }
+}
diff --git a/src/lib.rs b/src/lib.rs
index 442f9433dc..0c598298b1 100644
--- a/src/lib.rs
+++ b/src/lib.rs
@@ -1,3 +1,4 @@
+pub mod csum_file;
 pub mod hash;
 pub mod loose;
 pub mod varint;
diff --git a/src/meson.build b/src/meson.build
index 1eea068519..45739957b4 100644
--- a/src/meson.build
+++ b/src/meson.build
@@ -1,4 +1,5 @@
 libgit_rs_sources = [
+  'csum_file.rs',
   'hash.rs',
   'lib.rs',
   'loose.rs',

^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH v2 15/15] object-file-convert: always make sure object ID algo is valid
  2025-11-17 22:16 ` [PATCH v2 00/15] " brian m. carlson
                     ` (13 preceding siblings ...)
  2025-11-17 22:16   ` [PATCH v2 14/15] rust: add a small wrapper around the hashfile code brian m. carlson
@ 2025-11-17 22:16   ` brian m. carlson
  14 siblings, 0 replies; 101+ messages in thread
From: brian m. carlson @ 2025-11-17 22:16 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren

In some cases, we zero-initialize our object IDs, which sets the algo
member to zero as well, which is not a valid algorithm number.  This is
a bad practice, but we typically paper over it in many cases by simply
substituting the repository's hash algorithm.

However, our new Rust loose object map code doesn't handle this
gracefully and can't find object IDs when the algorithm is zero because
they don't compare equal to those with the correct algo field.  In
addition, the comparison code doesn't have any knowledge of what the
main algorithm is because that's global state, so we can't adjust the
comparison.

To make our code function properly and to avoid propagating these bad
entries, if we get a source object ID with a zero algo, just make a copy
of it with the fixed algorithm.  This has the benefit of also fixing the
object IDs if we're in a single algorithm mode as well.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 object-file-convert.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/object-file-convert.c b/object-file-convert.c
index e44c821084..f8dce94811 100644
--- a/object-file-convert.c
+++ b/object-file-convert.c
@@ -13,7 +13,7 @@
 #include "gpg-interface.h"
 #include "object-file-convert.h"

-int repo_oid_to_algop(struct repository *repo, const struct object_id *src,
+int repo_oid_to_algop(struct repository *repo, const struct object_id *srcoid,
 		      const struct git_hash_algo *to, struct object_id *dest)
 {
 	/*
@@ -21,7 +21,15 @@ int repo_oid_to_algop(struct repository *repo, const struct object_id *src,
 	 * default hash algorithm for that object.
 	 */
 	const struct git_hash_algo *from =
-		src->algo ? &hash_algos[src->algo] : repo->hash_algo;
+		srcoid->algo ? &hash_algos[srcoid->algo] : repo->hash_algo;
+	struct object_id temp;
+	const struct object_id *src = srcoid;
+
+	if (!srcoid->algo) {
+		oidcpy(&temp, srcoid);
+		temp.algo = hash_algo_by_ptr(repo->hash_algo);
+		src = &temp;
+	}

 	if (from == to || !to) {
 		if (src != dest)

^ permalink raw reply related	[flat|nested] 101+ messages in thread

end of thread, other threads:[~2025-11-20 23:14 UTC | newest]

Thread overview: 101+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-27  0:43 [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 brian m. carlson
2025-10-27  0:43 ` [PATCH 01/14] repository: require Rust support for interoperability brian m. carlson
2025-10-28  9:16   ` Patrick Steinhardt
2025-10-27  0:43 ` [PATCH 02/14] conversion: don't crash when no destination algo brian m. carlson
2025-10-27  0:43 ` [PATCH 03/14] hash: use uint32_t for object_id algorithm brian m. carlson
2025-10-28  9:16   ` Patrick Steinhardt
2025-10-28 18:28     ` Ezekiel Newren
2025-10-28 19:33     ` Junio C Hamano
2025-10-28 19:58       ` Ezekiel Newren
2025-10-28 20:20         ` Junio C Hamano
2025-10-30  0:23       ` brian m. carlson
2025-10-30  1:58         ` Collin Funk
2025-11-03  1:30           ` brian m. carlson
2025-10-29  0:33     ` brian m. carlson
2025-10-29  9:07       ` Patrick Steinhardt
2025-10-27  0:43 ` [PATCH 04/14] rust: add a ObjectID struct brian m. carlson
2025-10-28  9:17   ` Patrick Steinhardt
2025-10-28 19:07     ` Ezekiel Newren
2025-10-29  0:42       ` brian m. carlson
2025-10-28 19:40     ` Junio C Hamano
2025-10-29  0:47       ` brian m. carlson
2025-10-29  0:36     ` brian m. carlson
2025-10-29  9:08       ` Patrick Steinhardt
2025-10-30  0:32         ` brian m. carlson
2025-10-27  0:43 ` [PATCH 05/14] rust: add a hash algorithm abstraction brian m. carlson
2025-10-28  9:18   ` Patrick Steinhardt
2025-10-28 17:09     ` Ezekiel Newren
2025-10-28 20:00   ` Junio C Hamano
2025-10-28 20:03     ` Ezekiel Newren
2025-10-29 13:27       ` Junio C Hamano
2025-10-29 14:32         ` Junio C Hamano
2025-10-27  0:43 ` [PATCH 06/14] hash: add a function to look up hash algo structs brian m. carlson
2025-10-28  9:18   ` Patrick Steinhardt
2025-10-28 20:12   ` Junio C Hamano
2025-11-04  1:48     ` brian m. carlson
2025-11-04 10:24       ` Junio C Hamano
2025-10-27  0:43 ` [PATCH 07/14] csum-file: define hashwrite's count as a uint32_t brian m. carlson
2025-10-28 17:22   ` Ezekiel Newren
2025-10-27  0:43 ` [PATCH 08/14] write-or-die: add an fsync component for the loose object map brian m. carlson
2025-10-27  0:43 ` [PATCH 09/14] hash: expose hash context functions to Rust brian m. carlson
2025-10-29 16:32   ` Junio C Hamano
2025-10-30 21:42     ` brian m. carlson
2025-10-30 21:52       ` Junio C Hamano
2025-10-27  0:44 ` [PATCH 10/14] rust: add a build.rs script for tests brian m. carlson
2025-10-28  9:18   ` Patrick Steinhardt
2025-10-28 17:42     ` Ezekiel Newren
2025-10-29 16:43   ` Junio C Hamano
2025-10-29 22:10     ` Ezekiel Newren
2025-10-29 23:12       ` Junio C Hamano
2025-10-30  6:26         ` Patrick Steinhardt
2025-10-30 13:54           ` Junio C Hamano
2025-10-31 22:43             ` Ezekiel Newren
2025-11-01 11:18               ` Junio C Hamano
2025-10-27  0:44 ` [PATCH 11/14] rust: add functionality to hash an object brian m. carlson
2025-10-28  9:18   ` Patrick Steinhardt
2025-10-29  0:53     ` brian m. carlson
2025-10-29  9:07       ` Patrick Steinhardt
2025-10-28 18:05   ` Ezekiel Newren
2025-10-29  1:05     ` brian m. carlson
2025-10-29 16:02       ` Ben Knoble
2025-10-27  0:44 ` [PATCH 12/14] rust: add a new binary loose object map format brian m. carlson
2025-10-28  9:18   ` Patrick Steinhardt
2025-10-29  1:37     ` brian m. carlson
2025-10-29  9:07       ` Patrick Steinhardt
2025-10-29 17:03   ` Junio C Hamano
2025-10-29 18:21   ` Junio C Hamano
2025-10-27  0:44 ` [PATCH 13/14] rust: add a small wrapper around the hashfile code brian m. carlson
2025-10-28 18:19   ` Ezekiel Newren
2025-10-29  1:39     ` brian m. carlson
2025-10-27  0:44 ` [PATCH 14/14] object-file-convert: always make sure object ID algo is valid brian m. carlson
2025-10-29 20:07 ` [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 Junio C Hamano
2025-10-29 20:15   ` Junio C Hamano
2025-11-11  0:12 ` Ezekiel Newren
2025-11-14 17:25 ` Junio C Hamano
2025-11-14 21:11   ` Junio C Hamano
2025-11-17  6:56   ` Junio C Hamano
2025-11-17 22:09     ` brian m. carlson
2025-11-18  0:13       ` Junio C Hamano
2025-11-19 23:04         ` brian m. carlson
2025-11-19 23:24           ` Junio C Hamano
2025-11-19 23:37           ` Ezekiel Newren
2025-11-20 19:52             ` Ezekiel Newren
2025-11-20 23:02               ` brian m. carlson
2025-11-20 23:11                 ` Ezekiel Newren
2025-11-20 23:14                   ` Junio C Hamano
2025-11-17 22:16 ` [PATCH v2 00/15] " brian m. carlson
2025-11-17 22:16   ` [PATCH v2 01/15] repository: require Rust support for interoperability brian m. carlson
2025-11-17 22:16   ` [PATCH v2 02/15] conversion: don't crash when no destination algo brian m. carlson
2025-11-17 22:16   ` [PATCH v2 03/15] hash: use uint32_t for object_id algorithm brian m. carlson
2025-11-17 22:16   ` [PATCH v2 04/15] rust: add a ObjectID struct brian m. carlson
2025-11-17 22:16   ` [PATCH v2 05/15] rust: add a hash algorithm abstraction brian m. carlson
2025-11-17 22:16   ` [PATCH v2 06/15] hash: add a function to look up hash algo structs brian m. carlson
2025-11-17 22:16   ` [PATCH v2 07/15] rust: add additional helpers for ObjectID brian m. carlson
2025-11-17 22:16   ` [PATCH v2 08/15] csum-file: define hashwrite's count as a uint32_t brian m. carlson
2025-11-17 22:16   ` [PATCH v2 09/15] write-or-die: add an fsync component for the object map brian m. carlson
2025-11-17 22:16   ` [PATCH v2 10/15] hash: expose hash context functions to Rust brian m. carlson
2025-11-17 22:16   ` [PATCH v2 11/15] rust: add a build.rs script for tests brian m. carlson
2025-11-17 22:16   ` [PATCH v2 12/15] rust: add functionality to hash an object brian m. carlson
2025-11-17 22:16   ` [PATCH v2 13/15] rust: add a new binary object map format brian m. carlson
2025-11-17 22:16   ` [PATCH v2 14/15] rust: add a small wrapper around the hashfile code brian m. carlson
2025-11-17 22:16   ` [PATCH v2 15/15] object-file-convert: always make sure object ID algo is valid brian m. carlson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).