* [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2
@ 2025-10-27 0:43 brian m. carlson
2025-10-27 0:43 ` [PATCH 01/14] repository: require Rust support for interoperability brian m. carlson
` (17 more replies)
0 siblings, 18 replies; 101+ messages in thread
From: brian m. carlson @ 2025-10-27 0:43 UTC (permalink / raw)
To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren
This is the second part of the SHA-1/SHA-256 interoperability work. It
introduces our first major use of Rust code to implement a loose object
format as well as preparatory work to make that happen, including
changing types to more Rust-friendly ones. Since Rust will be required
for the interoperability work, we require that in the testsuite.
We also verify that our object ID algorithm is valid when looking up
data in the hash map since the Rust code intentionally has no knowledge
about global mutable state like the_repository and so cannot default to
the main hash algorithm when we've zero-initialized a struct object_id.
The advantage to this Rust code is that it is comprehensively tested
with unit testing. We can serialize our loose object map and then
verify that we can also load it again and perform various testing, such
as whether certain object IDs are found in the map and mapped correctly.
We can also test our slightly subtle custom binary search code
effectively and be confident that it works, since Rust doesn't provide a
way to binary search slices of variable length.
The new Rust files have adopted an approach that is slightly different
from some of our other files and placed a license notice at the top.
This is required because of DCO part (a): "I have the right to submit it
under the open source license indicated in the file". It also avoids
ambiguity if the file is copied into a separate location (such as an LLM
training corpus).
brian m. carlson (14):
repository: require Rust support for interoperability
conversion: don't crash when no destination algo
hash: use uint32_t for object_id algorithm
rust: add a ObjectID struct
rust: add a hash algorithm abstraction
hash: add a function to look up hash algo structs
csum-file: define hashwrite's count as a uint32_t
write-or-die: add an fsync component for the loose object map
hash: expose hash context functions to Rust
rust: add a build.rs script for tests
rust: add functionality to hash an object
rust: add a new binary loose object map format
rust: add a small wrapper around the hashfile code
object-file-convert: always make sure object ID algo is valid
Documentation/gitformat-loose.adoc | 104 ++++
Makefile | 5 +-
build.rs | 21 +
csum-file.c | 2 +-
csum-file.h | 2 +-
hash.c | 46 +-
hash.h | 38 +-
object-file-convert.c | 14 +-
oidtree.c | 2 +-
repository.c | 13 +-
repository.h | 4 +-
serve.c | 2 +-
src/csum_file.rs | 81 +++
src/hash.rs | 335 +++++++++++
src/lib.rs | 3 +
src/loose.rs | 912 +++++++++++++++++++++++++++++
src/meson.build | 3 +
t/t1006-cat-file.sh | 82 ++-
t/t1016-compatObjectFormat.sh | 6 +
t/t1500-rev-parse.sh | 2 +-
t/t9305-fast-import-signatures.sh | 4 +-
t/t9350-fast-export.sh | 4 +-
t/test-lib.sh | 4 +
write-or-die.h | 4 +-
24 files changed, 1619 insertions(+), 74 deletions(-)
create mode 100644 build.rs
create mode 100644 src/csum_file.rs
create mode 100644 src/hash.rs
create mode 100644 src/loose.rs
^ permalink raw reply [flat|nested] 101+ messages in thread* [PATCH 01/14] repository: require Rust support for interoperability 2025-10-27 0:43 [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 brian m. carlson @ 2025-10-27 0:43 ` brian m. carlson 2025-10-28 9:16 ` Patrick Steinhardt 2025-10-27 0:43 ` [PATCH 02/14] conversion: don't crash when no destination algo brian m. carlson ` (16 subsequent siblings) 17 siblings, 1 reply; 101+ messages in thread From: brian m. carlson @ 2025-10-27 0:43 UTC (permalink / raw) To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren We'll be implementing some of our interoperability code, like the loose object map, in Rust. While the code currently compiles with the old loose object map format, which is written entirely in C, we'll soon replace that with the Rust-based implementation. Require the use of Rust for compatibility mode and die if it is not supported. Because the repo argument is not used when Rust is missing, cast it to void to silence the compiler warning, which we do not care about. Add a prerequisite in our tests, RUST, that checks if Rust functionality is available and use it in the tests that handle interoperability. This is technically a regression in functionality compared to our existing state, but pack index v3 is not yet implemented and thus the functionality is mostly quite broken, which is why we've recently marked this functionality as experimental. We don't believe anyone is getting useful use out of the interoperability code in its current state, so no actual users should be negatively impacted by this change. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> --- repository.c | 7 +++ t/t1006-cat-file.sh | 82 +++++++++++++++++++++---------- t/t1016-compatObjectFormat.sh | 6 +++ t/t1500-rev-parse.sh | 2 +- t/t9305-fast-import-signatures.sh | 4 +- t/t9350-fast-export.sh | 4 +- t/test-lib.sh | 4 ++ 7 files changed, 77 insertions(+), 32 deletions(-) diff --git a/repository.c b/repository.c index 6faf5c7398..823f110019 100644 --- a/repository.c +++ b/repository.c @@ -3,6 +3,7 @@ #include "repository.h" #include "odb.h" #include "config.h" +#include "gettext.h" #include "object.h" #include "lockfile.h" #include "path.h" @@ -192,11 +193,17 @@ void repo_set_hash_algo(struct repository *repo, int hash_algo) void repo_set_compat_hash_algo(struct repository *repo, int algo) { +#ifdef WITH_RUST if (hash_algo_by_ptr(repo->hash_algo) == algo) BUG("hash_algo and compat_hash_algo match"); repo->compat_hash_algo = algo ? &hash_algos[algo] : NULL; if (repo->compat_hash_algo) repo_read_loose_object_map(repo); +#else + (void)repo; + if (algo) + die(_("compatibility hash algorithm support requires Rust")); +#endif } void repo_set_ref_storage_format(struct repository *repo, diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh index 1f61b666a7..29a9503523 100755 --- a/t/t1006-cat-file.sh +++ b/t/t1006-cat-file.sh @@ -241,10 +241,16 @@ hello_content="Hello World" hello_size=$(strlen "$hello_content") hello_oid=$(echo_without_newline "$hello_content" | git hash-object --stdin) -test_expect_success "setup" ' +test_expect_success "setup part 1" ' git config core.repositoryformatversion 1 && - git config extensions.objectformat $test_hash_algo && - git config extensions.compatobjectformat $test_compat_hash_algo && + git config extensions.objectformat $test_hash_algo +' + +test_expect_success RUST 'compat setup' ' + git config extensions.compatobjectformat $test_compat_hash_algo +' + +test_expect_success 'setup part 2' ' echo_without_newline "$hello_content" > hello && git update-index --add hello && echo_without_newline "$hello_content" > "path with spaces" && @@ -273,9 +279,13 @@ run_blob_tests () { ' } -hello_compat_oid=$(git rev-parse --output-object-format=$test_compat_hash_algo $hello_oid) run_blob_tests $hello_oid -run_blob_tests $hello_compat_oid + +if test_have_prereq RUST +then + hello_compat_oid=$(git rev-parse --output-object-format=$test_compat_hash_algo $hello_oid) + run_blob_tests $hello_compat_oid +fi test_expect_success '--batch-check without %(rest) considers whole line' ' echo "$hello_oid blob $hello_size" >expect && @@ -286,62 +296,76 @@ test_expect_success '--batch-check without %(rest) considers whole line' ' ' tree_oid=$(git write-tree) -tree_compat_oid=$(git rev-parse --output-object-format=$test_compat_hash_algo $tree_oid) tree_size=$((2 * $(test_oid rawsz) + 13 + 24)) -tree_compat_size=$((2 * $(test_oid --hash=compat rawsz) + 13 + 24)) tree_pretty_content="100644 blob $hello_oid hello${LF}100755 blob $hello_oid path with spaces${LF}" -tree_compat_pretty_content="100644 blob $hello_compat_oid hello${LF}100755 blob $hello_compat_oid path with spaces${LF}" run_tests 'tree' $tree_oid "" $tree_size "" "$tree_pretty_content" -run_tests 'tree' $tree_compat_oid "" $tree_compat_size "" "$tree_compat_pretty_content" run_tests 'blob' "$tree_oid:hello" "100644" $hello_size "" "$hello_content" $hello_oid -run_tests 'blob' "$tree_compat_oid:hello" "100644" $hello_size "" "$hello_content" $hello_compat_oid run_tests 'blob' "$tree_oid:path with spaces" "100755" $hello_size "" "$hello_content" $hello_oid -run_tests 'blob' "$tree_compat_oid:path with spaces" "100755" $hello_size "" "$hello_content" $hello_compat_oid + +if test_have_prereq RUST +then + tree_compat_oid=$(git rev-parse --output-object-format=$test_compat_hash_algo $tree_oid) + tree_compat_size=$((2 * $(test_oid --hash=compat rawsz) + 13 + 24)) + tree_compat_pretty_content="100644 blob $hello_compat_oid hello${LF}100755 blob $hello_compat_oid path with spaces${LF}" + + run_tests 'tree' $tree_compat_oid "" $tree_compat_size "" "$tree_compat_pretty_content" + run_tests 'blob' "$tree_compat_oid:hello" "100644" $hello_size "" "$hello_content" $hello_compat_oid + run_tests 'blob' "$tree_compat_oid:path with spaces" "100755" $hello_size "" "$hello_content" $hello_compat_oid +fi commit_message="Initial commit" commit_oid=$(echo_without_newline "$commit_message" | git commit-tree $tree_oid) -commit_compat_oid=$(git rev-parse --output-object-format=$test_compat_hash_algo $commit_oid) commit_size=$(($(test_oid hexsz) + 137)) -commit_compat_size=$(($(test_oid --hash=compat hexsz) + 137)) commit_content="tree $tree_oid author $GIT_AUTHOR_NAME <$GIT_AUTHOR_EMAIL> $GIT_AUTHOR_DATE committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE $commit_message" -commit_compat_content="tree $tree_compat_oid +run_tests 'commit' $commit_oid "" $commit_size "$commit_content" "$commit_content" + +if test_have_prereq RUST +then + commit_compat_oid=$(git rev-parse --output-object-format=$test_compat_hash_algo $commit_oid) + commit_compat_size=$(($(test_oid --hash=compat hexsz) + 137)) + commit_compat_content="tree $tree_compat_oid author $GIT_AUTHOR_NAME <$GIT_AUTHOR_EMAIL> $GIT_AUTHOR_DATE committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE $commit_message" -run_tests 'commit' $commit_oid "" $commit_size "$commit_content" "$commit_content" -run_tests 'commit' $commit_compat_oid "" $commit_compat_size "$commit_compat_content" "$commit_compat_content" + run_tests 'commit' $commit_compat_oid "" $commit_compat_size "$commit_compat_content" "$commit_compat_content" +fi tag_header_without_oid="type blob tag hellotag tagger $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL>" tag_header_without_timestamp="object $hello_oid $tag_header_without_oid" -tag_compat_header_without_timestamp="object $hello_compat_oid -$tag_header_without_oid" tag_description="This is a tag" tag_content="$tag_header_without_timestamp 0 +0000 -$tag_description" -tag_compat_content="$tag_compat_header_without_timestamp 0 +0000 - $tag_description" tag_oid=$(echo_without_newline "$tag_content" | git hash-object -t tag --stdin -w) tag_size=$(strlen "$tag_content") -tag_compat_oid=$(git rev-parse --output-object-format=$test_compat_hash_algo $tag_oid) -tag_compat_size=$(strlen "$tag_compat_content") - run_tests 'tag' $tag_oid "" $tag_size "$tag_content" "$tag_content" -run_tests 'tag' $tag_compat_oid "" $tag_compat_size "$tag_compat_content" "$tag_compat_content" + +if test_have_prereq RUST +then + tag_compat_header_without_timestamp="object $hello_compat_oid +$tag_header_without_oid" + tag_compat_content="$tag_compat_header_without_timestamp 0 +0000 + +$tag_description" + + tag_compat_oid=$(git rev-parse --output-object-format=$test_compat_hash_algo $tag_oid) + tag_compat_size=$(strlen "$tag_compat_content") + + run_tests 'tag' $tag_compat_oid "" $tag_compat_size "$tag_compat_content" "$tag_compat_content" +fi test_expect_success "Reach a blob from a tag pointing to it" ' echo_without_newline "$hello_content" >expect && @@ -590,7 +614,8 @@ flush" } batch_tests $hello_oid $tree_oid $tree_size $commit_oid $commit_size "$commit_content" $tag_oid $tag_size "$tag_content" -batch_tests $hello_compat_oid $tree_compat_oid $tree_compat_size $commit_compat_oid $commit_compat_size "$commit_compat_content" $tag_compat_oid $tag_compat_size "$tag_compat_content" + +test_have_prereq RUST && batch_tests $hello_compat_oid $tree_compat_oid $tree_compat_size $commit_compat_oid $commit_compat_size "$commit_compat_content" $tag_compat_oid $tag_compat_size "$tag_compat_content" test_expect_success FUNNYNAMES 'setup with newline in input' ' @@ -1226,7 +1251,10 @@ test_expect_success 'batch-check with a submodule' ' test_unconfig extensions.compatobjectformat && printf "160000 commit $(test_oid deadbeef)\tsub\n" >tree-with-sub && tree=$(git mktree <tree-with-sub) && - test_config extensions.compatobjectformat $test_compat_hash_algo && + if test_have_prereq RUST + then + test_config extensions.compatobjectformat $test_compat_hash_algo + fi && git cat-file --batch-check >actual <<-EOF && $tree:sub diff --git a/t/t1016-compatObjectFormat.sh b/t/t1016-compatObjectFormat.sh index a9af8b2396..af3ceac3f5 100755 --- a/t/t1016-compatObjectFormat.sh +++ b/t/t1016-compatObjectFormat.sh @@ -8,6 +8,12 @@ test_description='Test how well compatObjectFormat works' . ./test-lib.sh . "$TEST_DIRECTORY"/lib-gpg.sh +if ! test_have_prereq RUST +then + skip_all='interoperability requires a Git built with Rust' + test_done +fi + # All of the follow variables must be defined in the environment: # GIT_AUTHOR_NAME # GIT_AUTHOR_EMAIL diff --git a/t/t1500-rev-parse.sh b/t/t1500-rev-parse.sh index 7739ab611b..98c5a772bd 100755 --- a/t/t1500-rev-parse.sh +++ b/t/t1500-rev-parse.sh @@ -208,7 +208,7 @@ test_expect_success 'rev-parse --show-object-format in repo' ' ' -test_expect_success 'rev-parse --show-object-format in repo with compat mode' ' +test_expect_success RUST 'rev-parse --show-object-format in repo with compat mode' ' mkdir repo && ( sane_unset GIT_DEFAULT_HASH && diff --git a/t/t9305-fast-import-signatures.sh b/t/t9305-fast-import-signatures.sh index c2b4271658..63c0a2b5c4 100755 --- a/t/t9305-fast-import-signatures.sh +++ b/t/t9305-fast-import-signatures.sh @@ -70,7 +70,7 @@ test_expect_success GPGSSH 'strip SSH signature with --signed-commits=strip' ' test_must_be_empty log ' -test_expect_success GPG 'setup a commit with dual OpenPGP signatures on its SHA-1 and SHA-256 formats' ' +test_expect_success RUST,GPG 'setup a commit with dual OpenPGP signatures on its SHA-1 and SHA-256 formats' ' # Create a signed SHA-256 commit git init --object-format=sha256 explicit-sha256 && git -C explicit-sha256 config extensions.compatObjectFormat sha1 && @@ -91,7 +91,7 @@ test_expect_success GPG 'setup a commit with dual OpenPGP signatures on its SHA- test_grep -E "^gpgsig-sha256 " out ' -test_expect_success GPG 'strip both OpenPGP signatures with --signed-commits=warn-strip' ' +test_expect_success RUST,GPG 'strip both OpenPGP signatures with --signed-commits=warn-strip' ' git -C explicit-sha256 fast-export --signed-commits=verbatim dual-signed >output && test_grep -E "^gpgsig sha1 openpgp" output && test_grep -E "^gpgsig sha256 openpgp" output && diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh index 8f85c69d62..bf55e1e2e6 100755 --- a/t/t9350-fast-export.sh +++ b/t/t9350-fast-export.sh @@ -932,7 +932,7 @@ test_expect_success 'fast-export handles --end-of-options' ' test_cmp expect actual ' -test_expect_success GPG 'setup a commit with dual signatures on its SHA-1 and SHA-256 formats' ' +test_expect_success GPG,RUST 'setup a commit with dual signatures on its SHA-1 and SHA-256 formats' ' # Create a signed SHA-256 commit git init --object-format=sha256 explicit-sha256 && git -C explicit-sha256 config extensions.compatObjectFormat sha1 && @@ -953,7 +953,7 @@ test_expect_success GPG 'setup a commit with dual signatures on its SHA-1 and SH test_grep -E "^gpgsig-sha256 " out ' -test_expect_success GPG 'export and import of doubly signed commit' ' +test_expect_success GPG,RUST 'export and import of doubly signed commit' ' git -C explicit-sha256 fast-export --signed-commits=verbatim dual-signed >output && test_grep -E "^gpgsig sha1 openpgp" output && test_grep -E "^gpgsig sha256 openpgp" output && diff --git a/t/test-lib.sh b/t/test-lib.sh index ef0ab7ec2d..3499a83806 100644 --- a/t/test-lib.sh +++ b/t/test-lib.sh @@ -1890,6 +1890,10 @@ test_lazy_prereq LONG_IS_64BIT ' test 8 -le "$(build_option sizeof-long)" ' +test_lazy_prereq RUST ' + test "$(build_option rust)" = enabled +' + test_lazy_prereq TIME_IS_64BIT 'test-tool date is64bit' test_lazy_prereq TIME_T_IS_64BIT 'test-tool date time_t-is64bit' ^ permalink raw reply related [flat|nested] 101+ messages in thread
* Re: [PATCH 01/14] repository: require Rust support for interoperability 2025-10-27 0:43 ` [PATCH 01/14] repository: require Rust support for interoperability brian m. carlson @ 2025-10-28 9:16 ` Patrick Steinhardt 0 siblings, 0 replies; 101+ messages in thread From: Patrick Steinhardt @ 2025-10-28 9:16 UTC (permalink / raw) To: brian m. carlson; +Cc: git, Junio C Hamano, Ezekiel Newren On Mon, Oct 27, 2025 at 12:43:51AM +0000, brian m. carlson wrote: > We'll be implementing some of our interoperability code, like the loose > object map, in Rust. While the code currently compiles with the old > loose object map format, which is written entirely in C, we'll soon > replace that with the Rust-based implementation. > > Require the use of Rust for compatibility mode and die if it is not > supported. Because the repo argument is not used when Rust is missing, > cast it to void to silence the compiler warning, which we do not care > about. > > Add a prerequisite in our tests, RUST, that checks if Rust functionality > is available and use it in the tests that handle interoperability. > > This is technically a regression in functionality compared to our > existing state, but pack index v3 is not yet implemented and thus the > functionality is mostly quite broken, which is why we've recently marked > this functionality as experimental. We don't believe anyone is getting > useful use out of the interoperability code in its current state, so no > actual users should be negatively impacted by this change. Yeah, I don't see much of an issue with this. > diff --git a/repository.c b/repository.c > index 6faf5c7398..823f110019 100644 > --- a/repository.c > +++ b/repository.c > @@ -192,11 +193,17 @@ void repo_set_hash_algo(struct repository *repo, int hash_algo) > > void repo_set_compat_hash_algo(struct repository *repo, int algo) > { > +#ifdef WITH_RUST > if (hash_algo_by_ptr(repo->hash_algo) == algo) > BUG("hash_algo and compat_hash_algo match"); > repo->compat_hash_algo = algo ? &hash_algos[algo] : NULL; > if (repo->compat_hash_algo) > repo_read_loose_object_map(repo); > +#else > + (void)repo; You can annotate `repo` with `MAYBE_UNUSED` instead of casting. Patrick ^ permalink raw reply [flat|nested] 101+ messages in thread
* [PATCH 02/14] conversion: don't crash when no destination algo 2025-10-27 0:43 [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 brian m. carlson 2025-10-27 0:43 ` [PATCH 01/14] repository: require Rust support for interoperability brian m. carlson @ 2025-10-27 0:43 ` brian m. carlson 2025-10-27 0:43 ` [PATCH 03/14] hash: use uint32_t for object_id algorithm brian m. carlson ` (15 subsequent siblings) 17 siblings, 0 replies; 101+ messages in thread From: brian m. carlson @ 2025-10-27 0:43 UTC (permalink / raw) To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren When we set up a repository that doesn't have a compatibility hash algorithm, we set the destination algorithm object to NULL. In such a case, we want to silently do nothing instead of crashing, so simply treat the operation as a no-op and copy the object ID. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> --- object-file-convert.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/object-file-convert.c b/object-file-convert.c index 7ab875afe6..e44c821084 100644 --- a/object-file-convert.c +++ b/object-file-convert.c @@ -23,7 +23,7 @@ int repo_oid_to_algop(struct repository *repo, const struct object_id *src, const struct git_hash_algo *from = src->algo ? &hash_algos[src->algo] : repo->hash_algo; - if (from == to) { + if (from == to || !to) { if (src != dest) oidcpy(dest, src); return 0; ^ permalink raw reply related [flat|nested] 101+ messages in thread
* [PATCH 03/14] hash: use uint32_t for object_id algorithm 2025-10-27 0:43 [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 brian m. carlson 2025-10-27 0:43 ` [PATCH 01/14] repository: require Rust support for interoperability brian m. carlson 2025-10-27 0:43 ` [PATCH 02/14] conversion: don't crash when no destination algo brian m. carlson @ 2025-10-27 0:43 ` brian m. carlson 2025-10-28 9:16 ` Patrick Steinhardt 2025-10-27 0:43 ` [PATCH 04/14] rust: add a ObjectID struct brian m. carlson ` (14 subsequent siblings) 17 siblings, 1 reply; 101+ messages in thread From: brian m. carlson @ 2025-10-27 0:43 UTC (permalink / raw) To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren We currently use an int for this value, but we'll define this structure from Rust in a future commit and we want to ensure that our data types are exactly identical. To make that possible, use a uint32_t for the hash algorithm. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> --- hash.c | 6 +++--- hash.h | 10 +++++----- oidtree.c | 2 +- repository.c | 6 +++--- repository.h | 4 ++-- serve.c | 2 +- 6 files changed, 15 insertions(+), 15 deletions(-) diff --git a/hash.c b/hash.c index 4a04ecb50e..81b4f87027 100644 --- a/hash.c +++ b/hash.c @@ -241,7 +241,7 @@ const char *empty_tree_oid_hex(const struct git_hash_algo *algop) return oid_to_hex_r(buf, algop->empty_tree); } -int hash_algo_by_name(const char *name) +uint32_t hash_algo_by_name(const char *name) { if (!name) return GIT_HASH_UNKNOWN; @@ -251,7 +251,7 @@ int hash_algo_by_name(const char *name) return GIT_HASH_UNKNOWN; } -int hash_algo_by_id(uint32_t format_id) +uint32_t hash_algo_by_id(uint32_t format_id) { for (size_t i = 1; i < GIT_HASH_NALGOS; i++) if (format_id == hash_algos[i].format_id) @@ -259,7 +259,7 @@ int hash_algo_by_id(uint32_t format_id) return GIT_HASH_UNKNOWN; } -int hash_algo_by_length(size_t len) +uint32_t hash_algo_by_length(size_t len) { for (size_t i = 1; i < GIT_HASH_NALGOS; i++) if (len == hash_algos[i].rawsz) diff --git a/hash.h b/hash.h index fae966b23c..99c9c2a0a8 100644 --- a/hash.h +++ b/hash.h @@ -211,7 +211,7 @@ static inline void git_SHA256_Clone(git_SHA256_CTX *dst, const git_SHA256_CTX *s struct object_id { unsigned char hash[GIT_MAX_RAWSZ]; - int algo; /* XXX requires 4-byte alignment */ + uint32_t algo; /* XXX requires 4-byte alignment */ }; #define GET_OID_QUIETLY 01 @@ -344,13 +344,13 @@ static inline void git_hash_final_oid(struct object_id *oid, struct git_hash_ctx * Return a GIT_HASH_* constant based on the name. Returns GIT_HASH_UNKNOWN if * the name doesn't match a known algorithm. */ -int hash_algo_by_name(const char *name); +uint32_t hash_algo_by_name(const char *name); /* Identical, except based on the format ID. */ -int hash_algo_by_id(uint32_t format_id); +uint32_t hash_algo_by_id(uint32_t format_id); /* Identical, except based on the length. */ -int hash_algo_by_length(size_t len); +uint32_t hash_algo_by_length(size_t len); /* Identical, except for a pointer to struct git_hash_algo. */ -static inline int hash_algo_by_ptr(const struct git_hash_algo *p) +static inline uint32_t hash_algo_by_ptr(const struct git_hash_algo *p) { size_t i; for (i = 0; i < GIT_HASH_NALGOS; i++) { diff --git a/oidtree.c b/oidtree.c index 151568f74f..324de94934 100644 --- a/oidtree.c +++ b/oidtree.c @@ -10,7 +10,7 @@ struct oidtree_iter_data { oidtree_iter fn; void *arg; size_t *last_nibble_at; - int algo; + uint32_t algo; uint8_t last_byte; }; diff --git a/repository.c b/repository.c index 823f110019..34a029b1e4 100644 --- a/repository.c +++ b/repository.c @@ -39,7 +39,7 @@ struct repository *the_repository = &the_repo; static void set_default_hash_algo(struct repository *repo) { const char *hash_name; - int algo; + uint32_t algo; hash_name = getenv("GIT_TEST_DEFAULT_HASH_ALGO"); if (!hash_name) @@ -186,12 +186,12 @@ void repo_set_gitdir(struct repository *repo, repo->gitdir, "index"); } -void repo_set_hash_algo(struct repository *repo, int hash_algo) +void repo_set_hash_algo(struct repository *repo, uint32_t hash_algo) { repo->hash_algo = &hash_algos[hash_algo]; } -void repo_set_compat_hash_algo(struct repository *repo, int algo) +void repo_set_compat_hash_algo(struct repository *repo, uint32_t algo) { #ifdef WITH_RUST if (hash_algo_by_ptr(repo->hash_algo) == algo) diff --git a/repository.h b/repository.h index 5808a5d610..c0a3543b24 100644 --- a/repository.h +++ b/repository.h @@ -193,8 +193,8 @@ struct set_gitdir_args { void repo_set_gitdir(struct repository *repo, const char *root, const struct set_gitdir_args *extra_args); void repo_set_worktree(struct repository *repo, const char *path); -void repo_set_hash_algo(struct repository *repo, int algo); -void repo_set_compat_hash_algo(struct repository *repo, int compat_algo); +void repo_set_hash_algo(struct repository *repo, uint32_t algo); +void repo_set_compat_hash_algo(struct repository *repo, uint32_t compat_algo); void repo_set_ref_storage_format(struct repository *repo, enum ref_storage_format format); void initialize_repository(struct repository *repo); diff --git a/serve.c b/serve.c index 53ecab3b42..49a6e39b1d 100644 --- a/serve.c +++ b/serve.c @@ -14,7 +14,7 @@ static int advertise_sid = -1; static int advertise_object_info = -1; -static int client_hash_algo = GIT_HASH_SHA1_LEGACY; +static uint32_t client_hash_algo = GIT_HASH_SHA1_LEGACY; static int always_advertise(struct repository *r UNUSED, struct strbuf *value UNUSED) ^ permalink raw reply related [flat|nested] 101+ messages in thread
* Re: [PATCH 03/14] hash: use uint32_t for object_id algorithm 2025-10-27 0:43 ` [PATCH 03/14] hash: use uint32_t for object_id algorithm brian m. carlson @ 2025-10-28 9:16 ` Patrick Steinhardt 2025-10-28 18:28 ` Ezekiel Newren ` (2 more replies) 0 siblings, 3 replies; 101+ messages in thread From: Patrick Steinhardt @ 2025-10-28 9:16 UTC (permalink / raw) To: brian m. carlson; +Cc: git, Junio C Hamano, Ezekiel Newren On Mon, Oct 27, 2025 at 12:43:53AM +0000, brian m. carlson wrote: > We currently use an int for this value, but we'll define this structure > from Rust in a future commit and we want to ensure that our data types > are exactly identical. To make that possible, use a uint32_t for the > hash algorithm. An alternative would be to introduce an enum and set up bindgen so that we can pull this enum into Rust. I'd personally favor that over using an uint32_t as it conveys way more meaning. Have you considered this? Patrick ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH 03/14] hash: use uint32_t for object_id algorithm 2025-10-28 9:16 ` Patrick Steinhardt @ 2025-10-28 18:28 ` Ezekiel Newren 2025-10-28 19:33 ` Junio C Hamano 2025-10-29 0:33 ` brian m. carlson 2 siblings, 0 replies; 101+ messages in thread From: Ezekiel Newren @ 2025-10-28 18:28 UTC (permalink / raw) To: Patrick Steinhardt; +Cc: brian m. carlson, git, Junio C Hamano On Tue, Oct 28, 2025 at 3:17 AM Patrick Steinhardt <ps@pks.im> wrote: > > On Mon, Oct 27, 2025 at 12:43:53AM +0000, brian m. carlson wrote: > > We currently use an int for this value, but we'll define this structure > > from Rust in a future commit and we want to ensure that our data types > > are exactly identical. To make that possible, use a uint32_t for the > > hash algorithm. > > An alternative would be to introduce an enum and set up bindgen so that > we can pull this enum into Rust. I'd personally favor that over using an > uint32_t as it conveys way more meaning. Have you considered this? I think uint32_t is appropriate here over an enum because this value will also exist on disk. An enum in Rust is really only safe if it exists exclusively in memory, and is untouched by C. Later in this patch series there is a function that creates an enum from a u32. I agree with Brian's design choice here. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH 03/14] hash: use uint32_t for object_id algorithm 2025-10-28 9:16 ` Patrick Steinhardt 2025-10-28 18:28 ` Ezekiel Newren @ 2025-10-28 19:33 ` Junio C Hamano 2025-10-28 19:58 ` Ezekiel Newren 2025-10-30 0:23 ` brian m. carlson 2025-10-29 0:33 ` brian m. carlson 2 siblings, 2 replies; 101+ messages in thread From: Junio C Hamano @ 2025-10-28 19:33 UTC (permalink / raw) To: Patrick Steinhardt; +Cc: brian m. carlson, git, Ezekiel Newren Patrick Steinhardt <ps@pks.im> writes: > On Mon, Oct 27, 2025 at 12:43:53AM +0000, brian m. carlson wrote: >> We currently use an int for this value, but we'll define this structure >> from Rust in a future commit and we want to ensure that our data types >> are exactly identical. To make that possible, use a uint32_t for the >> hash algorithm. > > An alternative would be to introduce an enum and set up bindgen so that > we can pull this enum into Rust. I'd personally favor that over using an > uint32_t as it conveys way more meaning. Have you considered this? Yeah, I do not very much appreciate change from "int" to "uint32_t" randomly done only for things that happen to be used by both C and Rust. "When should I use 'int' or 'unsigned' and when should I use 'uint32_t'?" becomes extremely hard to answer. I suspect that it would be much more palatable if these functions and struct members are to use a distinct type that is used only by hash algorithm number (your "enum" is fine), that is typedef'ed to be the 32-bit unsigned integer, e.g, +typedef uint32_t hash_algo_type; -int hash_algo_by_name(const char *name) +hash_algo_type hash_algo_by_name(const char *name) Yeah, I know that C does not give us type safety against mixing two different things, both of which are typedef'ed to the same uint32_t, but doing something like the above would still add documentation value. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH 03/14] hash: use uint32_t for object_id algorithm 2025-10-28 19:33 ` Junio C Hamano @ 2025-10-28 19:58 ` Ezekiel Newren 2025-10-28 20:20 ` Junio C Hamano 2025-10-30 0:23 ` brian m. carlson 1 sibling, 1 reply; 101+ messages in thread From: Ezekiel Newren @ 2025-10-28 19:58 UTC (permalink / raw) To: Junio C Hamano; +Cc: Patrick Steinhardt, brian m. carlson, git On Tue, Oct 28, 2025 at 1:33 PM Junio C Hamano <gitster@pobox.com> wrote: > > Patrick Steinhardt <ps@pks.im> writes: > > > On Mon, Oct 27, 2025 at 12:43:53AM +0000, brian m. carlson wrote: > >> We currently use an int for this value, but we'll define this structure > >> from Rust in a future commit and we want to ensure that our data types > >> are exactly identical. To make that possible, use a uint32_t for the > >> hash algorithm. > > > > An alternative would be to introduce an enum and set up bindgen so that > > we can pull this enum into Rust. I'd personally favor that over using an > > uint32_t as it conveys way more meaning. Have you considered this? > > Yeah, I do not very much appreciate change from "int" to "uint32_t" > randomly done only for things that happen to be used by both C and > Rust. "When should I use 'int' or 'unsigned' and when should I use > 'uint32_t'?" becomes extremely hard to answer. I think the most appropriate time to change from C's ambiguous types to unambiguous types is when it's going to be used for Rust FFI. uint32_t should be used everywhere and casting to int or unsigned should be done where that code hasn't been converted yet. This commit isn't random, it's a deliberate effort to address code debt. > I suspect that it would be much more palatable if these functions > and struct members are to use a distinct type that is used only by > hash algorithm number (your "enum" is fine), that is typedef'ed to > be the 32-bit unsigned integer, e.g, > > +typedef uint32_t hash_algo_type; > -int hash_algo_by_name(const char *name) > +hash_algo_type hash_algo_by_name(const char *name) > > Yeah, I know that C does not give us type safety against mixing two > different things, both of which are typedef'ed to the same uint32_t, > but doing something like the above would still add documentation > value. I'm against passing Rust enum types over the FFI boundary since Rust is free to add extra bytes to distinguish between types (and it's documented by Rust as not being ABI stable). Even if something like #[repr(C)] is used the problem is that the enum on the Rust side will have an implicit field where that implicit field will need to be made explicit on the C side, and if C sets an invalid value for that implicit field then that will result in Rust UB. Converting Rust enum types to C is non-trival and has many gotchas. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH 03/14] hash: use uint32_t for object_id algorithm 2025-10-28 19:58 ` Ezekiel Newren @ 2025-10-28 20:20 ` Junio C Hamano 0 siblings, 0 replies; 101+ messages in thread From: Junio C Hamano @ 2025-10-28 20:20 UTC (permalink / raw) To: Ezekiel Newren; +Cc: Patrick Steinhardt, brian m. carlson, git Ezekiel Newren <ezekielnewren@gmail.com> writes: >> I suspect that it would be much more palatable if these functions >> and struct members are to use a distinct type that is used only by >> hash algorithm number (your "enum" is fine), that is typedef'ed to >> be the 32-bit unsigned integer, e.g, >> >> +typedef uint32_t hash_algo_type; >> -int hash_algo_by_name(const char *name) >> +hash_algo_type hash_algo_by_name(const char *name) >> >> Yeah, I know that C does not give us type safety against mixing two >> different things, both of which are typedef'ed to the same uint32_t, >> but doing something like the above would still add documentation >> value. > > I'm against passing Rust enum types over the FFI boundary since Rust > is free to add extra bytes to distinguish between types (and it's > documented by Rust as not being ABI stable). It's OK for you to be against it. My mention of "enum" was enum on the purely C-side and I didn't have Rust's enum in mind at all. As Brian defined ObjectID on the Rust side, the type tag was done as u32, IIUC, not Rust's enum. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH 03/14] hash: use uint32_t for object_id algorithm 2025-10-28 19:33 ` Junio C Hamano 2025-10-28 19:58 ` Ezekiel Newren @ 2025-10-30 0:23 ` brian m. carlson 2025-10-30 1:58 ` Collin Funk 1 sibling, 1 reply; 101+ messages in thread From: brian m. carlson @ 2025-10-30 0:23 UTC (permalink / raw) To: Junio C Hamano; +Cc: Patrick Steinhardt, git, Ezekiel Newren [-- Attachment #1: Type: text/plain, Size: 1810 bytes --] On 2025-10-28 at 19:33:32, Junio C Hamano wrote: > Yeah, I do not very much appreciate change from "int" to "uint32_t" > randomly done only for things that happen to be used by both C and > Rust. "When should I use 'int' or 'unsigned' and when should I use > 'uint32_t'?" becomes extremely hard to answer. In general, the answer is that we should use `int` or `unsigned` when you're defining a loop index or other non-structure types that are only used from C. Otherwise, we should use one of the stdint.h or stddef.h types ((u)int*_t, (s)size_t, etc.), since these have defined, well-understood sizes. Also, in general, we want to use unsigned types for things that cannot have valid negative values (such as the hash algorithm constants that are also array indices), especially since Rust tends not to use sentinel values (preferring `Option` instead). Part of our problem is that being lazy and making lots of assumptions in our codebase has led to some suboptimal consequences. Our diff code can't handle files bigger than about 1 GiB because we use `int` and Windows has all sorts of size limitations because we assumed that sizeof(long) == sizeof(size_t) == sizeof(void *). Nobody now would say, "Gee, I think we'd like to have these arbitrary 32-bit size limits," and using something with a fixed size helps us think, "How big should this data type be? Do I really want to limit this data structure to processing only 32 bits worth of data?" In this case, the use of a 32-bit value is fine because we already have that for the existing type (via `int`) and it is extremely unlikely that 4 billion cryptographic hash algorithms will ever be created, let alone implemented in Git, so the size is not a factor. -- brian m. carlson (they/them) Toronto, Ontario, CA [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 262 bytes --] ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH 03/14] hash: use uint32_t for object_id algorithm 2025-10-30 0:23 ` brian m. carlson @ 2025-10-30 1:58 ` Collin Funk 2025-11-03 1:30 ` brian m. carlson 0 siblings, 1 reply; 101+ messages in thread From: Collin Funk @ 2025-10-30 1:58 UTC (permalink / raw) To: brian m. carlson; +Cc: Junio C Hamano, Patrick Steinhardt, git, Ezekiel Newren Hi Brian, "brian m. carlson" <sandals@crustytoothpaste.net> writes: > On 2025-10-28 at 19:33:32, Junio C Hamano wrote: >> Yeah, I do not very much appreciate change from "int" to "uint32_t" >> randomly done only for things that happen to be used by both C and >> Rust. "When should I use 'int' or 'unsigned' and when should I use >> 'uint32_t'?" becomes extremely hard to answer. > > In general, the answer is that we should use `int` or `unsigned` when > you're defining a loop index or other non-structure types that are only > used from C. Otherwise, we should use one of the stdint.h or stddef.h > types ((u)int*_t, (s)size_t, etc.), since these have defined, > well-understood sizes. Also, in general, we want to use unsigned types > for things that cannot have valid negative values (such as the hash > algorithm constants that are also array indices), especially since Rust > tends not to use sentinel values (preferring `Option` instead). I don't necessarily disagree with your point, just want to reiterate a point a touched on in another thread [1]. In some cases it is valuable to use signed integers even if a valid value will never be negative. This is because signed integer overflow can be easily caught with -fsanitize=undefined. An unsigned integer wrapping around is perfectly defined, but may lead to strange bugs in your program. > Part of our problem is that being lazy and making lots of assumptions in > our codebase has led to some suboptimal consequences. Our diff code > can't handle files bigger than about 1 GiB because we use `int` and > Windows has all sorts of size limitations because we assumed that > sizeof(long) == sizeof(size_t) == sizeof(void *). Nobody now would say, > "Gee, I think we'd like to have these arbitrary 32-bit size limits," and > using something with a fixed size helps us think, "How big should this > data type be? Do I really want to limit this data structure to > processing only 32 bits worth of data?" > > In this case, the use of a 32-bit value is fine because we already have > that for the existing type (via `int`) and it is extremely unlikely that > 4 billion cryptographic hash algorithms will ever be created, let alone > implemented in Git, so the size is not a factor. I guess intmax_t and uintmax_t are probably not usable with Rust, since they are not fixed width? Collin [1] https://public-inbox.org/git/87jz16dux5.fsf@gmail.com/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH 03/14] hash: use uint32_t for object_id algorithm 2025-10-30 1:58 ` Collin Funk @ 2025-11-03 1:30 ` brian m. carlson 0 siblings, 0 replies; 101+ messages in thread From: brian m. carlson @ 2025-11-03 1:30 UTC (permalink / raw) To: Collin Funk; +Cc: Junio C Hamano, Patrick Steinhardt, git, Ezekiel Newren [-- Attachment #1: Type: text/plain, Size: 1031 bytes --] On 2025-10-30 at 01:58:52, Collin Funk wrote: > I guess intmax_t and uintmax_t are probably not usable with Rust, since > they are not fixed width? They are effectively 64 bit everywhere, so `i64` or `u64` is appropriate. These types are not actually the largest possible integers anymore, since they were originally defined as 64 bit and implementers refused to change them once 128-bit values were supported because that would break ABI. With gcc or clang, you can do this to see: % clang -E -dM - </dev/null | grep INTMAX_TYPE #define __INTMAX_TYPE__ long int #define __UINTMAX_TYPE__ long unsigned int Rust also has `i128` and `u128`, which are part of the ABI and are also used for things like `std::time::Duration::as_nanos`. Rust claims that it is ABI-compatible with C's `__int128` where that exists, but it does not in all C compilers and on all architectures. Compatibility with C's `_BitInt(128)` is explicitly disclaimed. -- brian m. carlson (they/them) Toronto, Ontario, CA [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 262 bytes --] ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH 03/14] hash: use uint32_t for object_id algorithm 2025-10-28 9:16 ` Patrick Steinhardt 2025-10-28 18:28 ` Ezekiel Newren 2025-10-28 19:33 ` Junio C Hamano @ 2025-10-29 0:33 ` brian m. carlson 2025-10-29 9:07 ` Patrick Steinhardt 2 siblings, 1 reply; 101+ messages in thread From: brian m. carlson @ 2025-10-29 0:33 UTC (permalink / raw) To: Patrick Steinhardt; +Cc: git, Junio C Hamano, Ezekiel Newren [-- Attachment #1: Type: text/plain, Size: 578 bytes --] On 2025-10-28 at 09:16:57, Patrick Steinhardt wrote: > An alternative would be to introduce an enum and set up bindgen so that > we can pull this enum into Rust. I'd personally favor that over using an > uint32_t as it conveys way more meaning. Have you considered this? That would lead to problems because we zero-initialize some object IDs (and you see later in the series what problems that causes) and that will absolutely not work in Rust, since setting an enum to an invalid value is undefined behaviour. -- brian m. carlson (they/them) Toronto, Ontario, CA [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 262 bytes --] ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH 03/14] hash: use uint32_t for object_id algorithm 2025-10-29 0:33 ` brian m. carlson @ 2025-10-29 9:07 ` Patrick Steinhardt 0 siblings, 0 replies; 101+ messages in thread From: Patrick Steinhardt @ 2025-10-29 9:07 UTC (permalink / raw) To: brian m. carlson, git, Junio C Hamano, Ezekiel Newren On Wed, Oct 29, 2025 at 12:33:30AM +0000, brian m. carlson wrote: > On 2025-10-28 at 09:16:57, Patrick Steinhardt wrote: > > An alternative would be to introduce an enum and set up bindgen so that > > we can pull this enum into Rust. I'd personally favor that over using an > > uint32_t as it conveys way more meaning. Have you considered this? > > That would lead to problems because we zero-initialize some object IDs > (and you see later in the series what problems that causes) and that > will absolutely not work in Rust, since setting an enum to an invalid > value is undefined behaviour. We could of course try and represent the uninitialized state with a third enum state. But it would probably make things awfully unergonomic all over the place :/ Patrick ^ permalink raw reply [flat|nested] 101+ messages in thread
* [PATCH 04/14] rust: add a ObjectID struct 2025-10-27 0:43 [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 brian m. carlson ` (2 preceding siblings ...) 2025-10-27 0:43 ` [PATCH 03/14] hash: use uint32_t for object_id algorithm brian m. carlson @ 2025-10-27 0:43 ` brian m. carlson 2025-10-28 9:17 ` Patrick Steinhardt 2025-10-27 0:43 ` [PATCH 05/14] rust: add a hash algorithm abstraction brian m. carlson ` (13 subsequent siblings) 17 siblings, 1 reply; 101+ messages in thread From: brian m. carlson @ 2025-10-27 0:43 UTC (permalink / raw) To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren We'd like to be able to write some Rust code that can work with object IDs. Add a structure here that's identical to struct object_id in C. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> --- Makefile | 1 + src/hash.rs | 21 +++++++++++++++++++++ src/lib.rs | 1 + src/meson.build | 1 + 4 files changed, 24 insertions(+) create mode 100644 src/hash.rs diff --git a/Makefile b/Makefile index 1919d35bf3..7e5a735ca6 100644 --- a/Makefile +++ b/Makefile @@ -1521,6 +1521,7 @@ CLAR_TEST_OBJS += $(UNIT_TEST_DIR)/unit-test.o UNIT_TEST_OBJS += $(UNIT_TEST_DIR)/test-lib.o +RUST_SOURCES += src/hash.rs RUST_SOURCES += src/lib.rs RUST_SOURCES += src/varint.rs diff --git a/src/hash.rs b/src/hash.rs new file mode 100644 index 0000000000..0219391820 --- /dev/null +++ b/src/hash.rs @@ -0,0 +1,21 @@ +// This program is free software; you can redistribute it and/or modify +// it under the terms of the GNU General Public License as published by +// the Free Software Foundation: version 2 of the License, dated June 1991. +// +// This program is distributed in the hope that it will be useful, +// but WITHOUT ANY WARRANTY; without even the implied warranty of +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +// GNU General Public License for more details. +// +// You should have received a copy of the GNU General Public License along +// with this program; if not, see <https://www.gnu.org/licenses/>. + +pub const GIT_MAX_RAWSZ: usize = 32; + +/// A binary object ID. +#[repr(C)] +#[derive(Debug, Clone, Ord, PartialOrd, Eq, PartialEq)] +pub struct ObjectID { + pub hash: [u8; GIT_MAX_RAWSZ], + pub algo: u32, +} diff --git a/src/lib.rs b/src/lib.rs index 9da70d8b57..cf7c962509 100644 --- a/src/lib.rs +++ b/src/lib.rs @@ -1 +1,2 @@ +pub mod hash; pub mod varint; diff --git a/src/meson.build b/src/meson.build index 25b9ad5a14..c77041a3fa 100644 --- a/src/meson.build +++ b/src/meson.build @@ -1,4 +1,5 @@ libgit_rs_sources = [ + 'hash.rs', 'lib.rs', 'varint.rs', ] ^ permalink raw reply related [flat|nested] 101+ messages in thread
* Re: [PATCH 04/14] rust: add a ObjectID struct 2025-10-27 0:43 ` [PATCH 04/14] rust: add a ObjectID struct brian m. carlson @ 2025-10-28 9:17 ` Patrick Steinhardt 2025-10-28 19:07 ` Ezekiel Newren ` (2 more replies) 0 siblings, 3 replies; 101+ messages in thread From: Patrick Steinhardt @ 2025-10-28 9:17 UTC (permalink / raw) To: brian m. carlson; +Cc: git, Junio C Hamano, Ezekiel Newren On Mon, Oct 27, 2025 at 12:43:54AM +0000, brian m. carlson wrote: > diff --git a/src/hash.rs b/src/hash.rs > new file mode 100644 > index 0000000000..0219391820 > --- /dev/null > +++ b/src/hash.rs > @@ -0,0 +1,21 @@ > +// This program is free software; you can redistribute it and/or modify > +// it under the terms of the GNU General Public License as published by > +// the Free Software Foundation: version 2 of the License, dated June 1991. > +// > +// This program is distributed in the hope that it will be useful, > +// but WITHOUT ANY WARRANTY; without even the implied warranty of > +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > +// GNU General Public License for more details. > +// > +// You should have received a copy of the GNU General Public License along > +// with this program; if not, see <https://www.gnu.org/licenses/>. We typically don't have these headers for our C code, so why have it over here? > +pub const GIT_MAX_RAWSZ: usize = 32; > + > +/// A binary object ID. > +#[repr(C)] > +#[derive(Debug, Clone, Ord, PartialOrd, Eq, PartialEq)] > +pub struct ObjectID { > + pub hash: [u8; GIT_MAX_RAWSZ], > + pub algo: u32, > +} An alternative to represent this type would be to use an enum: pub enum ObjectID { SHA1([u8; GIT_SHA1_RAWSZ]), SHA256([u8; GIT_SHA256_RAWSZ]), } That would give us some type safety going forward, but it might be harder to work with for us? Patrick ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH 04/14] rust: add a ObjectID struct 2025-10-28 9:17 ` Patrick Steinhardt @ 2025-10-28 19:07 ` Ezekiel Newren 2025-10-29 0:42 ` brian m. carlson 2025-10-28 19:40 ` Junio C Hamano 2025-10-29 0:36 ` brian m. carlson 2 siblings, 1 reply; 101+ messages in thread From: Ezekiel Newren @ 2025-10-28 19:07 UTC (permalink / raw) To: Patrick Steinhardt; +Cc: brian m. carlson, git, Junio C Hamano On Tue, Oct 28, 2025 at 3:17 AM Patrick Steinhardt <ps@pks.im> wrote: > > On Mon, Oct 27, 2025 at 12:43:54AM +0000, brian m. carlson wrote: > > diff --git a/src/hash.rs b/src/hash.rs > > new file mode 100644 > > index 0000000000..0219391820 > > --- /dev/null > > +++ b/src/hash.rs > > @@ -0,0 +1,21 @@ > > +// This program is free software; you can redistribute it and/or modify > > +// it under the terms of the GNU General Public License as published by > > +// the Free Software Foundation: version 2 of the License, dated June 1991. > > +// > > +// This program is distributed in the hope that it will be useful, > > +// but WITHOUT ANY WARRANTY; without even the implied warranty of > > +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > > +// GNU General Public License for more details. > > +// > > +// You should have received a copy of the GNU General Public License along > > +// with this program; if not, see <https://www.gnu.org/licenses/>. > > We typically don't have these headers for our C code, so why have it > over here? I'm wondering this too even though you gave a reason in your cover letter. I'm against putting licenses in each source file, and don't see how it's better than having a separate license file. > > +pub const GIT_MAX_RAWSZ: usize = 32; > > + > > +/// A binary object ID. > > +#[repr(C)] > > +#[derive(Debug, Clone, Ord, PartialOrd, Eq, PartialEq)] > > +pub struct ObjectID { > > + pub hash: [u8; GIT_MAX_RAWSZ], > > + pub algo: u32, > > +} > > An alternative to represent this type would be to use an enum: > > pub enum ObjectID { > SHA1([u8; GIT_SHA1_RAWSZ]), > SHA256([u8; GIT_SHA256_RAWSZ]), > } > > That would give us some type safety going forward, but it might be > harder to work with for us? This would be fine if it was used exclusively in Rust, but since this is a type that has to cross the FFI boundary it should be defined as a struct in C and Rust. If you run size_of::<ObjectId>() you'll get 33 (but it could be something else). Without #[repr(C, u8)] the Rust compiler is free to choose how to define the discriminant (its length and values) to distinguish the 2 types. If you do use #[repr(C, u8)] then you have the possible problem of C setting an invalid discriminant value which would result in undefined behavior. It also doesn't make sense as an FFI type since a Rust enum is closer to a C union than a C enum. The point here is that Brian is matching the existing C struct with an equivalent Rust struct. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH 04/14] rust: add a ObjectID struct 2025-10-28 19:07 ` Ezekiel Newren @ 2025-10-29 0:42 ` brian m. carlson 0 siblings, 0 replies; 101+ messages in thread From: brian m. carlson @ 2025-10-29 0:42 UTC (permalink / raw) To: Ezekiel Newren; +Cc: Patrick Steinhardt, git, Junio C Hamano [-- Attachment #1: Type: text/plain, Size: 1474 bytes --] On 2025-10-28 at 19:07:36, Ezekiel Newren wrote: > I'm wondering this too even though you gave a reason in your cover > letter. I'm against putting licenses in each source file, and don't > see how it's better than having a separate license file. As I said, the DCO says the "open source license indicated in the file". I also see lots of open source code being sucked into LLMs these days as training data and I want the LLM to learn that Git's code is GPLv2, so when it produces output, it does so with the GPLv2 header in the file. We already have similar notices in the reftable code, so there's plenty of precedent for it. > This would be fine if it was used exclusively in Rust, but since this > is a type that has to cross the FFI boundary it should be defined as a > struct in C and Rust. If you run size_of::<ObjectId>() you'll get 33 > (but it could be something else). Without #[repr(C, u8)] the Rust > compiler is free to choose how to define the discriminant (its length > and values) to distinguish the 2 types. If you do use #[repr(C, u8)] > then you have the possible problem of C setting an invalid > discriminant value which would result in undefined behavior. It also > doesn't make sense as an FFI type since a Rust enum is closer to a C > union than a C enum. The point here is that Brian is matching the > existing C struct with an equivalent Rust struct. Exactly. -- brian m. carlson (they/them) Toronto, Ontario, CA [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 262 bytes --] ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH 04/14] rust: add a ObjectID struct 2025-10-28 9:17 ` Patrick Steinhardt 2025-10-28 19:07 ` Ezekiel Newren @ 2025-10-28 19:40 ` Junio C Hamano 2025-10-29 0:47 ` brian m. carlson 2025-10-29 0:36 ` brian m. carlson 2 siblings, 1 reply; 101+ messages in thread From: Junio C Hamano @ 2025-10-28 19:40 UTC (permalink / raw) To: Patrick Steinhardt; +Cc: brian m. carlson, git, Ezekiel Newren Patrick Steinhardt <ps@pks.im> writes: > On Mon, Oct 27, 2025 at 12:43:54AM +0000, brian m. carlson wrote: >> diff --git a/src/hash.rs b/src/hash.rs >> new file mode 100644 >> index 0000000000..0219391820 >> --- /dev/null >> +++ b/src/hash.rs >> @@ -0,0 +1,21 @@ >> +// This program is free software; you can redistribute it and/or modify >> +// it under the terms of the GNU General Public License as published by >> +// the Free Software Foundation: version 2 of the License, dated June 1991. >> +// >> +// This program is distributed in the hope that it will be useful, >> +// but WITHOUT ANY WARRANTY; without even the implied warranty of >> +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the >> +// GNU General Public License for more details. >> +// >> +// You should have received a copy of the GNU General Public License along >> +// with this program; if not, see <https://www.gnu.org/licenses/>. > > We typically don't have these headers for our C code, so why have it > over here? Yeah, another thing that puzzles me is if src/ is a good name for the directory in the longer run (unless we plan to rewrite everything in Rust, that is) for housing our source code written in Rust (I am assuming that *.c files are unwelcome in that directory). But it may be a separate topic, perhaps? >> +pub const GIT_MAX_RAWSZ: usize = 32; >> + >> +/// A binary object ID. >> +#[repr(C)] >> +#[derive(Debug, Clone, Ord, PartialOrd, Eq, PartialEq)] >> +pub struct ObjectID { >> + pub hash: [u8; GIT_MAX_RAWSZ], >> + pub algo: u32, >> +} > > An alternative to represent this type would be to use an enum: > > pub enum ObjectID { > SHA1([u8; GIT_SHA1_RAWSZ]), > SHA256([u8; GIT_SHA256_RAWSZ]), > } > > That would give us some type safety going forward, but it might be > harder to work with for us? Can the latter be made interoperate with the C side well, with the same memory layout? Perhaps there may be a way, but the way written in the patch looks more obviously identical to what we have on the C side, so... ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH 04/14] rust: add a ObjectID struct 2025-10-28 19:40 ` Junio C Hamano @ 2025-10-29 0:47 ` brian m. carlson 0 siblings, 0 replies; 101+ messages in thread From: brian m. carlson @ 2025-10-29 0:47 UTC (permalink / raw) To: Junio C Hamano; +Cc: Patrick Steinhardt, git, Ezekiel Newren [-- Attachment #1: Type: text/plain, Size: 1106 bytes --] On 2025-10-28 at 19:40:39, Junio C Hamano wrote: > Yeah, another thing that puzzles me is if src/ is a good name for > the directory in the longer run (unless we plan to rewrite > everything in Rust, that is) for housing our source code written in > Rust (I am assuming that *.c files are unwelcome in that directory). > But it may be a separate topic, perhaps? That's a standard location for Rust files. The root of the repository has `Cargo.toml` and `Cargo.lock`, source files go in `src`, and output goes in `target`. So there's not much of an option, really. The hierarchy of the source files also affects import locations. So `src/hash.rs` is the `crate::hash` module and , `src/foo/bar/baz.rs` is `crate::foo::bar::baz`. There's no reason that `*.c` files cannot live in `src`, but Cargo pays no attention to those (unless they're compiled with the `cc` crate as part of `build.rs`). We had a project at work that moved from C to Rust incrementally and we moved all the C files into `src`, which was not a problem. -- brian m. carlson (they/them) Toronto, Ontario, CA [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 262 bytes --] ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH 04/14] rust: add a ObjectID struct 2025-10-28 9:17 ` Patrick Steinhardt 2025-10-28 19:07 ` Ezekiel Newren 2025-10-28 19:40 ` Junio C Hamano @ 2025-10-29 0:36 ` brian m. carlson 2025-10-29 9:08 ` Patrick Steinhardt 2 siblings, 1 reply; 101+ messages in thread From: brian m. carlson @ 2025-10-29 0:36 UTC (permalink / raw) To: Patrick Steinhardt; +Cc: git, Junio C Hamano, Ezekiel Newren [-- Attachment #1: Type: text/plain, Size: 792 bytes --] On 2025-10-28 at 09:17:03, Patrick Steinhardt wrote: > We typically don't have these headers for our C code, so why have it > over here? This is explained in the cover letter. > An alternative to represent this type would be to use an enum: > > pub enum ObjectID { > SHA1([u8; GIT_SHA1_RAWSZ]), > SHA256([u8; GIT_SHA256_RAWSZ]), > } > > That would give us some type safety going forward, but it might be > harder to work with for us? I agree that would be a nicer end state, but that can't be cast from C, which we do later in the series. The goal is to have a type that is suitable for FFI between C and Rust and we will be able to switch once we have no more C code using this type. -- brian m. carlson (they/them) Toronto, Ontario, CA [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 262 bytes --] ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH 04/14] rust: add a ObjectID struct 2025-10-29 0:36 ` brian m. carlson @ 2025-10-29 9:08 ` Patrick Steinhardt 2025-10-30 0:32 ` brian m. carlson 0 siblings, 1 reply; 101+ messages in thread From: Patrick Steinhardt @ 2025-10-29 9:08 UTC (permalink / raw) To: brian m. carlson, git, Junio C Hamano, Ezekiel Newren On Wed, Oct 29, 2025 at 12:36:52AM +0000, brian m. carlson wrote: > On 2025-10-28 at 09:17:03, Patrick Steinhardt wrote: > > We typically don't have these headers for our C code, so why have it > > over here? > > This is explained in the cover letter. > > > An alternative to represent this type would be to use an enum: > > > > pub enum ObjectID { > > SHA1([u8; GIT_SHA1_RAWSZ]), > > SHA256([u8; GIT_SHA256_RAWSZ]), > > } > > > > That would give us some type safety going forward, but it might be > > harder to work with for us? > > I agree that would be a nicer end state, but that can't be cast from C, > which we do later in the series. The goal is to have a type that is > suitable for FFI between C and Rust and we will be able to switch once > we have no more C code using this type. Fair. I'm mostly asking all of these questions because this is our first Rust code in Git that is a bit more involved. So it's likely that this code will set precedent for how future code will look like, and ideally I'd like us to have code that is idiomatic Rust code. With the FFI code it's of course going to be a mixed bag, as we are somewhat bound by the C interfaces. But in the best case I'd imagine that we have low-level FFI primitives that bridge the gap between C and Rust, and then we build a higher-level interface on top of that which allows us to use it in an idiomatic fashion. I guess all of this will require a lot of iteration anyway as we gain more familiarity with Rust in our codebase. And things don't have to be perfect on the first try *shrug* Thanks! Patrick ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH 04/14] rust: add a ObjectID struct 2025-10-29 9:08 ` Patrick Steinhardt @ 2025-10-30 0:32 ` brian m. carlson 0 siblings, 0 replies; 101+ messages in thread From: brian m. carlson @ 2025-10-30 0:32 UTC (permalink / raw) To: Patrick Steinhardt; +Cc: git, Junio C Hamano, Ezekiel Newren [-- Attachment #1: Type: text/plain, Size: 1932 bytes --] On 2025-10-29 at 09:08:05, Patrick Steinhardt wrote: > I'm mostly asking all of these questions because this is our first Rust > code in Git that is a bit more involved. So it's likely that this code > will set precedent for how future code will look like, and ideally I'd > like us to have code that is idiomatic Rust code. In general, I'd like that, too, and that's a fair question. > With the FFI code it's of course going to be a mixed bag, as we are > somewhat bound by the C interfaces. But in the best case I'd imagine > that we have low-level FFI primitives that bridge the gap between C and > Rust, and then we build a higher-level interface on top of that which > allows us to use it in an idiomatic fashion. The reason I've made the decision to minimize conversions here is because the object ID lookups are in a hot path in `index-pack` and various protocol code. If we clone the Linux repository (in SHA-1) and want to convert it to SHA-256 as part of that clone, we may need to convert every object and then deltify it to write the SHA-256 pack. This is never going to really scream in terms of performance as you might imagine, but it can be better or worse and I've tried to make it a little better. Similarly, if we have 500,000 refs on the remote[0], each of those have/want pairs has to be potentially converted and we want people to feel positively about our performance. I will send a patch in a future series that will make this a little more idiomatic on the Rust side as well. > I guess all of this will require a lot of iteration anyway as we gain > more familiarity with Rust in our codebase. And things don't have to be > perfect on the first try *shrug* Yeah, we'll come up with some standards and design guidance as things go along. [0] Some major users of Git do have this order of number of refs. -- brian m. carlson (they/them) Toronto, Ontario, CA [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 262 bytes --] ^ permalink raw reply [flat|nested] 101+ messages in thread
* [PATCH 05/14] rust: add a hash algorithm abstraction 2025-10-27 0:43 [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 brian m. carlson ` (3 preceding siblings ...) 2025-10-27 0:43 ` [PATCH 04/14] rust: add a ObjectID struct brian m. carlson @ 2025-10-27 0:43 ` brian m. carlson 2025-10-28 9:18 ` Patrick Steinhardt 2025-10-28 20:00 ` Junio C Hamano 2025-10-27 0:43 ` [PATCH 06/14] hash: add a function to look up hash algo structs brian m. carlson ` (12 subsequent siblings) 17 siblings, 2 replies; 101+ messages in thread From: brian m. carlson @ 2025-10-27 0:43 UTC (permalink / raw) To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren This works very similarly to the existing one in C except that it doesn't provide any functionality to hash an object. We don't currently need that right now, but the use of those function pointers do make it substantially more difficult to write a bit-for-bit identical structure across the C/Rust interface, so omit them for now. Instead of the more customary "&self", use "self", because the former is the size of a pointer and the latter is the size of an integer on most systems. Don't define an unknown value but use an Option for that instead. Update the object ID structure to allow slicing the data appropriately for the algorithm. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> --- src/hash.rs | 142 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 142 insertions(+) diff --git a/src/hash.rs b/src/hash.rs index 0219391820..1b9f07489e 100644 --- a/src/hash.rs +++ b/src/hash.rs @@ -19,3 +19,145 @@ pub struct ObjectID { pub hash: [u8; GIT_MAX_RAWSZ], pub algo: u32, } + +#[allow(dead_code)] +impl ObjectID { + pub fn as_slice(&self) -> &[u8] { + match HashAlgorithm::from_u32(self.algo) { + Some(algo) => &self.hash[0..algo.raw_len()], + None => &self.hash, + } + } + + pub fn as_mut_slice(&mut self) -> &mut [u8] { + match HashAlgorithm::from_u32(self.algo) { + Some(algo) => &mut self.hash[0..algo.raw_len()], + None => &mut self.hash, + } + } +} + +/// A hash algorithm, +#[repr(C)] +#[derive(Debug, Copy, Clone, Ord, PartialOrd, Eq, PartialEq)] +pub enum HashAlgorithm { + SHA1 = 1, + SHA256 = 2, +} + +#[allow(dead_code)] +impl HashAlgorithm { + const SHA1_NULL_OID: ObjectID = ObjectID { + hash: [0u8; 32], + algo: Self::SHA1 as u32, + }; + const SHA256_NULL_OID: ObjectID = ObjectID { + hash: [0u8; 32], + algo: Self::SHA256 as u32, + }; + + const SHA1_EMPTY_TREE: ObjectID = ObjectID { + hash: *b"\x4b\x82\x5d\xc6\x42\xcb\x6e\xb9\xa0\x60\xe5\x4b\xf8\xd6\x92\x88\xfb\xee\x49\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00", + algo: Self::SHA1 as u32, + }; + const SHA256_EMPTY_TREE: ObjectID = ObjectID { + hash: *b"\x6e\xf1\x9b\x41\x22\x5c\x53\x69\xf1\xc1\x04\xd4\x5d\x8d\x85\xef\xa9\xb0\x57\xb5\x3b\x14\xb4\xb9\xb9\x39\xdd\x74\xde\xcc\x53\x21", + algo: Self::SHA256 as u32, + }; + + const SHA1_EMPTY_BLOB: ObjectID = ObjectID { + hash: *b"\xe6\x9d\xe2\x9b\xb2\xd1\xd6\x43\x4b\x8b\x29\xae\x77\x5a\xd8\xc2\xe4\x8c\x53\x91\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00", + algo: Self::SHA1 as u32, + }; + const SHA256_EMPTY_BLOB: ObjectID = ObjectID { + hash: *b"\x47\x3a\x0f\x4c\x3b\xe8\xa9\x36\x81\xa2\x67\xe3\xb1\xe9\xa7\xdc\xda\x11\x85\x43\x6f\xe1\x41\xf7\x74\x91\x20\xa3\x03\x72\x18\x13", + algo: Self::SHA256 as u32, + }; + + /// Return a hash algorithm based on the internal integer ID used by Git. + /// + /// Returns `None` if the algorithm doesn't indicate a valid algorithm. + pub const fn from_u32(algo: u32) -> Option<HashAlgorithm> { + match algo { + 1 => Some(HashAlgorithm::SHA1), + 2 => Some(HashAlgorithm::SHA256), + _ => None, + } + } + + /// Return a hash algorithm based on the internal integer ID used by Git. + /// + /// Returns `None` if the algorithm doesn't indicate a valid algorithm. + pub const fn from_format_id(algo: u32) -> Option<HashAlgorithm> { + match algo { + 0x73686131 => Some(HashAlgorithm::SHA1), + 0x73323536 => Some(HashAlgorithm::SHA256), + _ => None, + } + } + + /// The name of this hash algorithm as a string suitable for the configuration file. + pub const fn name(self) -> &'static str { + match self { + HashAlgorithm::SHA1 => "sha1", + HashAlgorithm::SHA256 => "sha256", + } + } + + /// The format ID of this algorithm for binary formats. + /// + /// Note that when writing this to a data format, it should be written in big-endian format + /// explicitly. + pub const fn format_id(self) -> u32 { + match self { + HashAlgorithm::SHA1 => 0x73686131, + HashAlgorithm::SHA256 => 0x73323536, + } + } + + /// The length of binary object IDs in this algorithm in bytes. + pub const fn raw_len(self) -> usize { + match self { + HashAlgorithm::SHA1 => 20, + HashAlgorithm::SHA256 => 32, + } + } + + /// The length of object IDs in this algorithm in hexadecimal characters. + pub const fn hex_len(self) -> usize { + self.raw_len() * 2 + } + + /// The number of bytes which is processed by one iteration of this algorithm's compression + /// function. + pub const fn block_size(self) -> usize { + match self { + HashAlgorithm::SHA1 => 64, + HashAlgorithm::SHA256 => 64, + } + } + + /// The object ID representing the empty blob. + pub const fn empty_blob(self) -> &'static ObjectID { + match self { + HashAlgorithm::SHA1 => &Self::SHA1_EMPTY_BLOB, + HashAlgorithm::SHA256 => &Self::SHA256_EMPTY_BLOB, + } + } + + /// The object ID representing the empty tree. + pub const fn empty_tree(self) -> &'static ObjectID { + match self { + HashAlgorithm::SHA1 => &Self::SHA1_EMPTY_TREE, + HashAlgorithm::SHA256 => &Self::SHA256_EMPTY_TREE, + } + } + + /// The object ID which is all zeros. + pub const fn null_oid(self) -> &'static ObjectID { + match self { + HashAlgorithm::SHA1 => &Self::SHA1_NULL_OID, + HashAlgorithm::SHA256 => &Self::SHA256_NULL_OID, + } + } +} ^ permalink raw reply related [flat|nested] 101+ messages in thread
* Re: [PATCH 05/14] rust: add a hash algorithm abstraction 2025-10-27 0:43 ` [PATCH 05/14] rust: add a hash algorithm abstraction brian m. carlson @ 2025-10-28 9:18 ` Patrick Steinhardt 2025-10-28 17:09 ` Ezekiel Newren 2025-10-28 20:00 ` Junio C Hamano 1 sibling, 1 reply; 101+ messages in thread From: Patrick Steinhardt @ 2025-10-28 9:18 UTC (permalink / raw) To: brian m. carlson; +Cc: git, Junio C Hamano, Ezekiel Newren On Mon, Oct 27, 2025 at 12:43:55AM +0000, brian m. carlson wrote: > diff --git a/src/hash.rs b/src/hash.rs > index 0219391820..1b9f07489e 100644 > --- a/src/hash.rs > +++ b/src/hash.rs > @@ -19,3 +19,145 @@ pub struct ObjectID { > pub hash: [u8; GIT_MAX_RAWSZ], > pub algo: u32, > } > + > +#[allow(dead_code)] > +impl ObjectID { > + pub fn as_slice(&self) -> &[u8] { > + match HashAlgorithm::from_u32(self.algo) { > + Some(algo) => &self.hash[0..algo.raw_len()], > + None => &self.hash, > + } > + } > + > + pub fn as_mut_slice(&mut self) -> &mut [u8] { > + match HashAlgorithm::from_u32(self.algo) { > + Some(algo) => &mut self.hash[0..algo.raw_len()], > + None => &mut self.hash, > + } > + } > +} > + > +/// A hash algorithm, > +#[repr(C)] > +#[derive(Debug, Copy, Clone, Ord, PartialOrd, Eq, PartialEq)] > +pub enum HashAlgorithm { > + SHA1 = 1, > + SHA256 = 2, > +} > + Seeing all the `match` statements: we could alternatively implement this as a trait. This would have the added benefit that we cannot miss updating any of the functions if we ever were to add another hash function. Patrick ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH 05/14] rust: add a hash algorithm abstraction 2025-10-28 9:18 ` Patrick Steinhardt @ 2025-10-28 17:09 ` Ezekiel Newren 0 siblings, 0 replies; 101+ messages in thread From: Ezekiel Newren @ 2025-10-28 17:09 UTC (permalink / raw) To: Patrick Steinhardt; +Cc: brian m. carlson, git, Junio C Hamano On Tue, Oct 28, 2025 at 3:18 AM Patrick Steinhardt <ps@pks.im> wrote: > > On Mon, Oct 27, 2025 at 12:43:55AM +0000, brian m. carlson wrote: > > diff --git a/src/hash.rs b/src/hash.rs > > index 0219391820..1b9f07489e 100644 > > --- a/src/hash.rs > > +++ b/src/hash.rs > > @@ -19,3 +19,145 @@ pub struct ObjectID { > > pub hash: [u8; GIT_MAX_RAWSZ], > > pub algo: u32, > > } > > + > > +#[allow(dead_code)] > > +impl ObjectID { > > + pub fn as_slice(&self) -> &[u8] { > > + match HashAlgorithm::from_u32(self.algo) { > > + Some(algo) => &self.hash[0..algo.raw_len()], > > + None => &self.hash, > > + } > > + } > > + > > + pub fn as_mut_slice(&mut self) -> &mut [u8] { > > + match HashAlgorithm::from_u32(self.algo) { > > + Some(algo) => &mut self.hash[0..algo.raw_len()], > > + None => &mut self.hash, > > + } > > + } > > +} > > + > > +/// A hash algorithm, > > +#[repr(C)] > > +#[derive(Debug, Copy, Clone, Ord, PartialOrd, Eq, PartialEq)] > > +pub enum HashAlgorithm { > > + SHA1 = 1, > > + SHA256 = 2, > > +} > > + > > Seeing all the `match` statements: we could alternatively implement this > as a trait. This would have the added benefit that we cannot miss > updating any of the functions if we ever were to add another hash > function. match is more strict than switch. If another enum type is added then the current code will not compile. While I do like the idea of using traits the problem is that the hash algorithm used needs to be known on disk. We can still use traits, but in conjunction with this enum. The part where we need to be careful is HashAlgorithm::from_u32() because if _3_ ever becomes valid then this code (currently) will say it's not. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH 05/14] rust: add a hash algorithm abstraction 2025-10-27 0:43 ` [PATCH 05/14] rust: add a hash algorithm abstraction brian m. carlson 2025-10-28 9:18 ` Patrick Steinhardt @ 2025-10-28 20:00 ` Junio C Hamano 2025-10-28 20:03 ` Ezekiel Newren 1 sibling, 1 reply; 101+ messages in thread From: Junio C Hamano @ 2025-10-28 20:00 UTC (permalink / raw) To: brian m. carlson; +Cc: git, Patrick Steinhardt, Ezekiel Newren "brian m. carlson" <sandals@crustytoothpaste.net> writes: > +#[allow(dead_code)] > +impl ObjectID { > + pub fn as_slice(&self) -> &[u8] { > + match HashAlgorithm::from_u32(self.algo) { > + Some(algo) => &self.hash[0..algo.raw_len()], > + None => &self.hash, > + } > + } > + > + pub fn as_mut_slice(&mut self) -> &mut [u8] { > + match HashAlgorithm::from_u32(self.algo) { > + Some(algo) => &mut self.hash[0..algo.raw_len()], > + None => &mut self.hash, > + } > + } > +} These cases for "None" surprised me a bit; I would have expected us to error out when given an algorithm we do not recognise. > + /// Return a hash algorithm based on the internal integer ID used by Git. > + /// > + /// Returns `None` if the algorithm doesn't indicate a valid algorithm. > + pub const fn from_u32(algo: u32) -> Option<HashAlgorithm> { > + match algo { > + 1 => Some(HashAlgorithm::SHA1), > + 2 => Some(HashAlgorithm::SHA256), > + _ => None, > + } > + } > + > + /// Return a hash algorithm based on the internal integer ID used by Git. > + /// > + /// Returns `None` if the algorithm doesn't indicate a valid algorithm. > + pub const fn from_format_id(algo: u32) -> Option<HashAlgorithm> { > + match algo { > + 0x73686131 => Some(HashAlgorithm::SHA1), > + 0x73323536 => Some(HashAlgorithm::SHA256), > + _ => None, > + } > + } > + /// The number of bytes which is processed by one iteration of this algorithm's compression > + /// function. > + pub const fn block_size(self) -> usize { > + match self { > + HashAlgorithm::SHA1 => 64, > + HashAlgorithm::SHA256 => 64, > + } > + } What we see in this patch seems to be a fairly complete rewrite of what we have in <hash.h>. I totally forgot that we had this "block size" there, which is only used in receive-pack.c when we compute the push certificate. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH 05/14] rust: add a hash algorithm abstraction 2025-10-28 20:00 ` Junio C Hamano @ 2025-10-28 20:03 ` Ezekiel Newren 2025-10-29 13:27 ` Junio C Hamano 0 siblings, 1 reply; 101+ messages in thread From: Ezekiel Newren @ 2025-10-28 20:03 UTC (permalink / raw) To: Junio C Hamano; +Cc: brian m. carlson, git, Patrick Steinhardt On Tue, Oct 28, 2025 at 2:00 PM Junio C Hamano <gitster@pobox.com> wrote: > > "brian m. carlson" <sandals@crustytoothpaste.net> writes: > > > +#[allow(dead_code)] > > +impl ObjectID { > > + pub fn as_slice(&self) -> &[u8] { > > + match HashAlgorithm::from_u32(self.algo) { > > + Some(algo) => &self.hash[0..algo.raw_len()], > > + None => &self.hash, > > + } > > + } > > + > > + pub fn as_mut_slice(&mut self) -> &mut [u8] { > > + match HashAlgorithm::from_u32(self.algo) { > > + Some(algo) => &mut self.hash[0..algo.raw_len()], > > + None => &mut self.hash, > > + } > > + } > > +} > > These cases for "None" surprised me a bit; I would have expected us > to error out when given an algorithm we do not recognise. I think _Result_ would be more appropriate here. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH 05/14] rust: add a hash algorithm abstraction 2025-10-28 20:03 ` Ezekiel Newren @ 2025-10-29 13:27 ` Junio C Hamano 2025-10-29 14:32 ` Junio C Hamano 0 siblings, 1 reply; 101+ messages in thread From: Junio C Hamano @ 2025-10-29 13:27 UTC (permalink / raw) To: Ezekiel Newren; +Cc: brian m. carlson, git, Patrick Steinhardt Ezekiel Newren <ezekielnewren@gmail.com> writes: >> > +impl ObjectID { >> > + pub fn as_slice(&self) -> &[u8] { >> > + match HashAlgorithm::from_u32(self.algo) { >> > + Some(algo) => &self.hash[0..algo.raw_len()], >> > + None => &self.hash, >> > + } >> > + } >> > + >> > + pub fn as_mut_slice(&mut self) -> &mut [u8] { >> > + match HashAlgorithm::from_u32(self.algo) { >> > + Some(algo) => &mut self.hash[0..algo.raw_len()], >> > + None => &mut self.hash, >> > + } >> > + } >> > +} >> >> These cases for "None" surprised me a bit; I would have expected us >> to error out when given an algorithm we do not recognise. > > I think _Result_ would be more appropriate here. Perhaps. But the Option/Result was not what I was suprised about. When algo is available, we gave back a slice that is properly sized, but when algo is not, I would have expected it to say "nope", instead of yielding the full area of memory available. That was the part I was surprised about. Perhaps as_mut_slice() side is justifiable (an uninitialized instance of ObjectID is filled by getting the full self.hash and filling it, plus filling the algo), but the same explanation would not apply on the read-only side. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH 05/14] rust: add a hash algorithm abstraction 2025-10-29 13:27 ` Junio C Hamano @ 2025-10-29 14:32 ` Junio C Hamano 0 siblings, 0 replies; 101+ messages in thread From: Junio C Hamano @ 2025-10-29 14:32 UTC (permalink / raw) To: Ezekiel Newren; +Cc: brian m. carlson, git, Patrick Steinhardt Junio C Hamano <gitster@pobox.com> writes: >>> These cases for "None" surprised me a bit; I would have expected us >>> to error out when given an algorithm we do not recognise. >> >> I think _Result_ would be more appropriate here. > > Perhaps. But the Option/Result was not what I was suprised about. > ... > Perhaps as_mut_slice() side is justifiable (an uninitialized > instance of ObjectID is filled by getting the full self.hash and > filling it, plus filling the algo), but the same explanation would > not apply on the read-only side. Rethinking, I guess the "why doesn't it fail in the None case?" is exactly the same question as "why Option, not Result?" as you suggested. Sorry for the noise. ^ permalink raw reply [flat|nested] 101+ messages in thread
* [PATCH 06/14] hash: add a function to look up hash algo structs 2025-10-27 0:43 [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 brian m. carlson ` (4 preceding siblings ...) 2025-10-27 0:43 ` [PATCH 05/14] rust: add a hash algorithm abstraction brian m. carlson @ 2025-10-27 0:43 ` brian m. carlson 2025-10-28 9:18 ` Patrick Steinhardt 2025-10-28 20:12 ` Junio C Hamano 2025-10-27 0:43 ` [PATCH 07/14] csum-file: define hashwrite's count as a uint32_t brian m. carlson ` (11 subsequent siblings) 17 siblings, 2 replies; 101+ messages in thread From: brian m. carlson @ 2025-10-27 0:43 UTC (permalink / raw) To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren In C, it's easy for us to look up a hash algorithm structure by its offset by simply indexing the hash_algos array. However, in Rust, we sometimes need a pointer to pass to a C function, but we have our own hash algorithm abstraction. To get one from the other, let's provide a simple function that looks up the C structure from the offset and expose it in Rust. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> --- hash.c | 5 +++++ hash.h | 1 + src/hash.rs | 15 +++++++++++++++ 3 files changed, 21 insertions(+) diff --git a/hash.c b/hash.c index 81b4f87027..2f4e88e501 100644 --- a/hash.c +++ b/hash.c @@ -241,6 +241,11 @@ const char *empty_tree_oid_hex(const struct git_hash_algo *algop) return oid_to_hex_r(buf, algop->empty_tree); } +const struct git_hash_algo *hash_algo_ptr_by_offset(uint32_t algo) +{ + return &hash_algos[algo]; +} + uint32_t hash_algo_by_name(const char *name) { if (!name) diff --git a/hash.h b/hash.h index 99c9c2a0a8..c47ac81989 100644 --- a/hash.h +++ b/hash.h @@ -340,6 +340,7 @@ static inline void git_hash_final_oid(struct object_id *oid, struct git_hash_ctx ctx->algop->final_oid_fn(oid, ctx); } +const struct git_hash_algo *hash_algo_ptr_by_offset(uint32_t algo); /* * Return a GIT_HASH_* constant based on the name. Returns GIT_HASH_UNKNOWN if * the name doesn't match a known algorithm. diff --git a/src/hash.rs b/src/hash.rs index 1b9f07489e..a5b9493bd8 100644 --- a/src/hash.rs +++ b/src/hash.rs @@ -10,6 +10,8 @@ // You should have received a copy of the GNU General Public License along // with this program; if not, see <https://www.gnu.org/licenses/>. +use std::os::raw::c_void; + pub const GIT_MAX_RAWSZ: usize = 32; /// A binary object ID. @@ -160,4 +162,17 @@ impl HashAlgorithm { HashAlgorithm::SHA256 => &Self::SHA256_NULL_OID, } } + + /// A pointer to the C `struct git_hash_algo` for interoperability with C. + pub fn hash_algo_ptr(self) -> *const c_void { + unsafe { c::hash_algo_ptr_by_offset(self as u32) } + } +} + +pub mod c { + use std::os::raw::c_void; + + extern "C" { + pub fn hash_algo_ptr_by_offset(n: u32) -> *const c_void; + } } ^ permalink raw reply related [flat|nested] 101+ messages in thread
* Re: [PATCH 06/14] hash: add a function to look up hash algo structs 2025-10-27 0:43 ` [PATCH 06/14] hash: add a function to look up hash algo structs brian m. carlson @ 2025-10-28 9:18 ` Patrick Steinhardt 2025-10-28 20:12 ` Junio C Hamano 1 sibling, 0 replies; 101+ messages in thread From: Patrick Steinhardt @ 2025-10-28 9:18 UTC (permalink / raw) To: brian m. carlson; +Cc: git, Junio C Hamano, Ezekiel Newren On Mon, Oct 27, 2025 at 12:43:56AM +0000, brian m. carlson wrote: > diff --git a/hash.c b/hash.c > index 81b4f87027..2f4e88e501 100644 > --- a/hash.c > +++ b/hash.c > @@ -241,6 +241,11 @@ const char *empty_tree_oid_hex(const struct git_hash_algo *algop) > return oid_to_hex_r(buf, algop->empty_tree); > } > > +const struct git_hash_algo *hash_algo_ptr_by_offset(uint32_t algo) > +{ > + return &hash_algos[algo]; > +} I think we should have some safety mechanisms here to verify that we don't cause an out-of-bounds access. > diff --git a/src/hash.rs b/src/hash.rs > index 1b9f07489e..a5b9493bd8 100644 > --- a/src/hash.rs > +++ b/src/hash.rs > @@ -160,4 +162,17 @@ impl HashAlgorithm { > HashAlgorithm::SHA256 => &Self::SHA256_NULL_OID, > } > } > + > + /// A pointer to the C `struct git_hash_algo` for interoperability with C. > + pub fn hash_algo_ptr(self) -> *const c_void { > + unsafe { c::hash_algo_ptr_by_offset(self as u32) } > + } > +} > + > +pub mod c { > + use std::os::raw::c_void; > + > + extern "C" { > + pub fn hash_algo_ptr_by_offset(n: u32) -> *const c_void; > + } > } I guess eventually we should replace such declarations via bindgen. If so, we could also pull in the `struct git_hash_algo` declaration and have the function reutrn that structure instead of a oid pointer. Patrick ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH 06/14] hash: add a function to look up hash algo structs 2025-10-27 0:43 ` [PATCH 06/14] hash: add a function to look up hash algo structs brian m. carlson 2025-10-28 9:18 ` Patrick Steinhardt @ 2025-10-28 20:12 ` Junio C Hamano 2025-11-04 1:48 ` brian m. carlson 1 sibling, 1 reply; 101+ messages in thread From: Junio C Hamano @ 2025-10-28 20:12 UTC (permalink / raw) To: brian m. carlson; +Cc: git, Patrick Steinhardt, Ezekiel Newren "brian m. carlson" <sandals@crustytoothpaste.net> writes: > In C, it's easy for us to look up a hash algorithm structure by its > offset by simply indexing the hash_algos array. However, in Rust, we > sometimes need a pointer to pass to a C function, but we have our own > hash algorithm abstraction. > > To get one from the other, let's provide a simple function that looks up > the C structure from the offset and expose it in Rust. > > Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> > --- > hash.c | 5 +++++ > hash.h | 1 + > src/hash.rs | 15 +++++++++++++++ > 3 files changed, 21 insertions(+) > > diff --git a/hash.c b/hash.c > index 81b4f87027..2f4e88e501 100644 > --- a/hash.c > +++ b/hash.c > @@ -241,6 +241,11 @@ const char *empty_tree_oid_hex(const struct git_hash_algo *algop) > return oid_to_hex_r(buf, algop->empty_tree); > } > > +const struct git_hash_algo *hash_algo_ptr_by_offset(uint32_t algo) > +{ > + return &hash_algos[algo]; > +} Hmph, technically "algo" may be an "offset" into the array, but I'd consider it an implementation detail. We have hash_algo instances floating somewhere in-core, and have a way to obtain a pointer to one of these instances by "algorithm number". For the user of the API, the fact that these instances are stored in contiguous pieces of memory as an array of struct is totally irrelevant. For that reason, I was somewhat repelled by the "by-offset" part of the function name. The next function ... > uint32_t hash_algo_by_name(const char *name) ... calls what it returns "hash_algo", but the "hash_algo" returned by this new function is quite different. One is just the "algorithm number", while the other is "algorithm instance". Perhaps calling both with the same name "hash algo" is the true source of confusing naming of this new function? > +use std::os::raw::c_void; > + > pub const GIT_MAX_RAWSZ: usize = 32; > > /// A binary object ID. > @@ -160,4 +162,17 @@ impl HashAlgorithm { > HashAlgorithm::SHA256 => &Self::SHA256_NULL_OID, > } > } > + > + /// A pointer to the C `struct git_hash_algo` for interoperability with C. > + pub fn hash_algo_ptr(self) -> *const c_void { > + unsafe { c::hash_algo_ptr_by_offset(self as u32) } > + } > +} > + > +pub mod c { > + use std::os::raw::c_void; > + > + extern "C" { > + pub fn hash_algo_ptr_by_offset(n: u32) -> *const c_void; > + } > } I am somewhat surprised that we do not expose "struct git_hash_algo" the same way a previous step exposed "struct object_id" in C as "struct ObjectID" in Rust, but instead pass its address as a void pointer. Hopefully the reason for doing so may become apparent as I read further into the series? ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH 06/14] hash: add a function to look up hash algo structs 2025-10-28 20:12 ` Junio C Hamano @ 2025-11-04 1:48 ` brian m. carlson 2025-11-04 10:24 ` Junio C Hamano 0 siblings, 1 reply; 101+ messages in thread From: brian m. carlson @ 2025-11-04 1:48 UTC (permalink / raw) To: Junio C Hamano; +Cc: git, Patrick Steinhardt, Ezekiel Newren [-- Attachment #1: Type: text/plain, Size: 3162 bytes --] On 2025-10-28 at 20:12:30, Junio C Hamano wrote: > "brian m. carlson" <sandals@crustytoothpaste.net> writes: > > > +const struct git_hash_algo *hash_algo_ptr_by_offset(uint32_t algo) > > +{ > > + return &hash_algos[algo]; > > +} > > Hmph, technically "algo" may be an "offset" into the array, but I'd > consider it an implementation detail. We have hash_algo instances > floating somewhere in-core, and have a way to obtain a pointer to > one of these instances by "algorithm number". For the user of the > API, the fact that these instances are stored in contiguous pieces > of memory as an array of struct is totally irrelevant. For that > reason, I was somewhat repelled by the "by-offset" part of the > function name. I fear I don't have a better name. "by_id" is the format ID. I could write "hash_algo_ptr_by_hash_algo" but that seems slightly bizarre and difficult to type. I could do "by_index", but you might have the same objection to that name. Would you like to propose a nicer alternative? > The next function ... > > > uint32_t hash_algo_by_name(const char *name) > > ... calls what it returns "hash_algo", but the "hash_algo" returned > by this new function is quite different. One is just the "algorithm > number", while the other is "algorithm instance". Perhaps calling > both with the same name "hash algo" is the true source of confusing > naming of this new function? Note that the name is "hash_algo_ptr", not "hash_algo". That is, we're explicitly returning a pointer to the structure here. I realize that's slightly hard to notice at first glance, but it was intentional. I had the same thought about using "hash_algo" as you did and for that reason decided to not create an ambiguous name. > I am somewhat surprised that we do not expose "struct git_hash_algo" > the same way a previous step exposed "struct object_id" in C as > "struct ObjectID" in Rust, but instead pass its address as a void > pointer. Hopefully the reason for doing so may become apparent as I > read further into the series? We're going to replace this with a nicer abstraction in Rust. Since we don't have bindgen or cbindgen yet, it's going to be kind of tricky to deal with the complexities of the structure such that we get it correctly aligned and matching and we only need to use it when working with C, so we don't bother to write out the details here. I certainly haven't measured, but I think the Rust compiler will be able to better optimize a function like `raw_len` with two explicit possibilities, especially when its `const`[0], than the C compiler will with reading what could be an arbitrary value out of the `rawsz` member. Because it's const, the compiler absolutely will be able to evaluate the size of anything where the hash algorithm is known at compile time and the fact that `hex_len` is defined in terms of `raw_len` provides a helpful hint for the compiler as well in that one is always twice the other. [0] `const` for a function meaning in this case that it can be evaluated at compile time. -- brian m. carlson (they/them) Toronto, Ontario, CA [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 262 bytes --] ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH 06/14] hash: add a function to look up hash algo structs 2025-11-04 1:48 ` brian m. carlson @ 2025-11-04 10:24 ` Junio C Hamano 0 siblings, 0 replies; 101+ messages in thread From: Junio C Hamano @ 2025-11-04 10:24 UTC (permalink / raw) To: brian m. carlson; +Cc: git, Patrick Steinhardt, Ezekiel Newren "brian m. carlson" <sandals@crustytoothpaste.net> writes: >> > +const struct git_hash_algo *hash_algo_ptr_by_offset(uint32_t algo) >> > +{ >> > + return &hash_algos[algo]; >> > +} >> >> Hmph, technically "algo" may be an "offset" into the array, but I'd >> consider it an implementation detail. We have hash_algo instances >> floating somewhere in-core, and have a way to obtain a pointer to >> one of these instances by "algorithm number". For the user of the >> API, the fact that these instances are stored in contiguous pieces >> of memory as an array of struct is totally irrelevant. For that >> reason, I was somewhat repelled by the "by-offset" part of the >> function name. > > I fear I don't have a better name. "by_id" is the format ID. I could > write "hash_algo_ptr_by_hash_algo" but that seems slightly bizarre and > difficult to type. I could do "by_index", but you might have the same > objection to that name. Would you like to propose a nicer alternative? const struct git_hash_algo *hash_algo_ptr_by_algo_number(uint32_t algo_num) { return &hash_algos[algo_num]; } Then, ... >> The next function ... >> >> > uint32_t hash_algo_by_name(const char *name) >> >> ... calls what it returns "hash_algo", but the "hash_algo" returned >> by this new function is quite different. One is just the "algorithm >> number", while the other is "algorithm instance". Perhaps calling >> both with the same name "hash algo" is the true source of confusing >> naming of this new function? ... would become uint32_t hash_algo_num_by_name(const char *name) perhaps. ^ permalink raw reply [flat|nested] 101+ messages in thread
* [PATCH 07/14] csum-file: define hashwrite's count as a uint32_t 2025-10-27 0:43 [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 brian m. carlson ` (5 preceding siblings ...) 2025-10-27 0:43 ` [PATCH 06/14] hash: add a function to look up hash algo structs brian m. carlson @ 2025-10-27 0:43 ` brian m. carlson 2025-10-28 17:22 ` Ezekiel Newren 2025-10-27 0:43 ` [PATCH 08/14] write-or-die: add an fsync component for the loose object map brian m. carlson ` (10 subsequent siblings) 17 siblings, 1 reply; 101+ messages in thread From: brian m. carlson @ 2025-10-27 0:43 UTC (permalink / raw) To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren We want to call this code from Rust and ensure that the types are the same for compatibility, which is easiest to do if the type is a fixed size. Since unsigned int is 32 bits on all the platforms we care about, define it as a uint32_t instead. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> --- csum-file.c | 2 +- csum-file.h | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/csum-file.c b/csum-file.c index 6e21e3cac8..3d3047c776 100644 --- a/csum-file.c +++ b/csum-file.c @@ -110,7 +110,7 @@ void discard_hashfile(struct hashfile *f) free_hashfile(f); } -void hashwrite(struct hashfile *f, const void *buf, unsigned int count) +void hashwrite(struct hashfile *f, const void *buf, uint32_t count) { while (count) { unsigned left = f->buffer_len - f->offset; diff --git a/csum-file.h b/csum-file.h index 07ae11024a..ecce9d27b0 100644 --- a/csum-file.h +++ b/csum-file.h @@ -63,7 +63,7 @@ void free_hashfile(struct hashfile *f); */ int finalize_hashfile(struct hashfile *, unsigned char *, enum fsync_component, unsigned int); void discard_hashfile(struct hashfile *); -void hashwrite(struct hashfile *, const void *, unsigned int); +void hashwrite(struct hashfile *, const void *, uint32_t); void hashflush(struct hashfile *f); void crc32_begin(struct hashfile *); uint32_t crc32_end(struct hashfile *); ^ permalink raw reply related [flat|nested] 101+ messages in thread
* Re: [PATCH 07/14] csum-file: define hashwrite's count as a uint32_t 2025-10-27 0:43 ` [PATCH 07/14] csum-file: define hashwrite's count as a uint32_t brian m. carlson @ 2025-10-28 17:22 ` Ezekiel Newren 0 siblings, 0 replies; 101+ messages in thread From: Ezekiel Newren @ 2025-10-28 17:22 UTC (permalink / raw) To: brian m. carlson; +Cc: git, Junio C Hamano, Patrick Steinhardt On Sun, Oct 26, 2025 at 6:44 PM brian m. carlson <sandals@crustytoothpaste.net> wrote: > > We want to call this code from Rust and ensure that the types are the > same for compatibility, which is easiest to do if the type is a fixed > size. Since unsigned int is 32 bits on all the platforms we care about, > define it as a uint32_t instead. I'm always in favor of converting to unambiguous types. ^ permalink raw reply [flat|nested] 101+ messages in thread
* [PATCH 08/14] write-or-die: add an fsync component for the loose object map 2025-10-27 0:43 [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 brian m. carlson ` (6 preceding siblings ...) 2025-10-27 0:43 ` [PATCH 07/14] csum-file: define hashwrite's count as a uint32_t brian m. carlson @ 2025-10-27 0:43 ` brian m. carlson 2025-10-27 0:43 ` [PATCH 09/14] hash: expose hash context functions to Rust brian m. carlson ` (9 subsequent siblings) 17 siblings, 0 replies; 101+ messages in thread From: brian m. carlson @ 2025-10-27 0:43 UTC (permalink / raw) To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren We'll soon be writing out a loose object map using the hashfile code. Add an fsync component to allow us to handle fsyncing it correctly. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> --- write-or-die.h | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/write-or-die.h b/write-or-die.h index 65a5c42a47..8d5ec23e1f 100644 --- a/write-or-die.h +++ b/write-or-die.h @@ -21,6 +21,7 @@ enum fsync_component { FSYNC_COMPONENT_COMMIT_GRAPH = 1 << 3, FSYNC_COMPONENT_INDEX = 1 << 4, FSYNC_COMPONENT_REFERENCE = 1 << 5, + FSYNC_COMPONENT_LOOSE_OBJECT_MAP = 1 << 6, }; #define FSYNC_COMPONENTS_OBJECTS (FSYNC_COMPONENT_LOOSE_OBJECT | \ @@ -44,7 +45,8 @@ enum fsync_component { FSYNC_COMPONENT_PACK_METADATA | \ FSYNC_COMPONENT_COMMIT_GRAPH | \ FSYNC_COMPONENT_INDEX | \ - FSYNC_COMPONENT_REFERENCE) + FSYNC_COMPONENT_REFERENCE | \ + FSYNC_COMPONENT_LOOSE_OBJECT_MAP) #ifndef FSYNC_COMPONENTS_PLATFORM_DEFAULT #define FSYNC_COMPONENTS_PLATFORM_DEFAULT FSYNC_COMPONENTS_DEFAULT ^ permalink raw reply related [flat|nested] 101+ messages in thread
* [PATCH 09/14] hash: expose hash context functions to Rust 2025-10-27 0:43 [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 brian m. carlson ` (7 preceding siblings ...) 2025-10-27 0:43 ` [PATCH 08/14] write-or-die: add an fsync component for the loose object map brian m. carlson @ 2025-10-27 0:43 ` brian m. carlson 2025-10-29 16:32 ` Junio C Hamano 2025-10-27 0:44 ` [PATCH 10/14] rust: add a build.rs script for tests brian m. carlson ` (8 subsequent siblings) 17 siblings, 1 reply; 101+ messages in thread From: brian m. carlson @ 2025-10-27 0:43 UTC (permalink / raw) To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren We'd like to be able to hash our data in Rust using the same contexts as in C. However, we need our helper functions to not be inline so they can be linked into the binary appropriately. In addition, to avoid managing memory manually and since we don't know the size of the hash context structure, we want to have simple alloc and free functions we can use to make sure a context can be easily dynamically created. Expose the helper functions and create alloc, free, and init functions we can call. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> --- hash.c | 35 +++++++++++++++++++++++++++++++++++ hash.h | 27 +++++++-------------------- 2 files changed, 42 insertions(+), 20 deletions(-) diff --git a/hash.c b/hash.c index 2f4e88e501..4977e13de6 100644 --- a/hash.c +++ b/hash.c @@ -246,6 +246,41 @@ const struct git_hash_algo *hash_algo_ptr_by_offset(uint32_t algo) return &hash_algos[algo]; } +struct git_hash_ctx *git_hash_alloc(void) +{ + return malloc(sizeof(struct git_hash_ctx)); +} + +void git_hash_free(struct git_hash_ctx *ctx) +{ + free(ctx); +} + +void git_hash_init(struct git_hash_ctx *ctx, const struct git_hash_algo *algop) +{ + algop->init_fn(ctx); +} + +void git_hash_clone(struct git_hash_ctx *dst, const struct git_hash_ctx *src) +{ + src->algop->clone_fn(dst, src); +} + +void git_hash_update(struct git_hash_ctx *ctx, const void *in, size_t len) +{ + ctx->algop->update_fn(ctx, in, len); +} + +void git_hash_final(unsigned char *hash, struct git_hash_ctx *ctx) +{ + ctx->algop->final_fn(hash, ctx); +} + +void git_hash_final_oid(struct object_id *oid, struct git_hash_ctx *ctx) +{ + ctx->algop->final_oid_fn(oid, ctx); +} + uint32_t hash_algo_by_name(const char *name) { if (!name) diff --git a/hash.h b/hash.h index c47ac81989..a937b8aff0 100644 --- a/hash.h +++ b/hash.h @@ -320,27 +320,14 @@ struct git_hash_algo { }; extern const struct git_hash_algo hash_algos[GIT_HASH_NALGOS]; -static inline void git_hash_clone(struct git_hash_ctx *dst, const struct git_hash_ctx *src) -{ - src->algop->clone_fn(dst, src); -} - -static inline void git_hash_update(struct git_hash_ctx *ctx, const void *in, size_t len) -{ - ctx->algop->update_fn(ctx, in, len); -} - -static inline void git_hash_final(unsigned char *hash, struct git_hash_ctx *ctx) -{ - ctx->algop->final_fn(hash, ctx); -} - -static inline void git_hash_final_oid(struct object_id *oid, struct git_hash_ctx *ctx) -{ - ctx->algop->final_oid_fn(oid, ctx); -} - +void git_hash_init(struct git_hash_ctx *ctx, const struct git_hash_algo *algop); +void git_hash_clone(struct git_hash_ctx *dst, const struct git_hash_ctx *src); +void git_hash_update(struct git_hash_ctx *ctx, const void *in, size_t len); +void git_hash_final(unsigned char *hash, struct git_hash_ctx *ctx); +void git_hash_final_oid(struct object_id *oid, struct git_hash_ctx *ctx); const struct git_hash_algo *hash_algo_ptr_by_offset(uint32_t algo); +struct git_hash_ctx *git_hash_alloc(void); +void git_hash_free(struct git_hash_ctx *ctx); /* * Return a GIT_HASH_* constant based on the name. Returns GIT_HASH_UNKNOWN if * the name doesn't match a known algorithm. ^ permalink raw reply related [flat|nested] 101+ messages in thread
* Re: [PATCH 09/14] hash: expose hash context functions to Rust 2025-10-27 0:43 ` [PATCH 09/14] hash: expose hash context functions to Rust brian m. carlson @ 2025-10-29 16:32 ` Junio C Hamano 2025-10-30 21:42 ` brian m. carlson 0 siblings, 1 reply; 101+ messages in thread From: Junio C Hamano @ 2025-10-29 16:32 UTC (permalink / raw) To: brian m. carlson; +Cc: git, Patrick Steinhardt, Ezekiel Newren "brian m. carlson" <sandals@crustytoothpaste.net> writes: > +struct git_hash_ctx *git_hash_alloc(void) > +{ > + return malloc(sizeof(struct git_hash_ctx)); > +} Not an objection, but this looked especially curious to me because it has been customary to use xmalloc() for a thing like this. Going forward, is our intention that we'd explicitly handle OOM allocation failures ourselves, at least in the Rust part of the code base? ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH 09/14] hash: expose hash context functions to Rust 2025-10-29 16:32 ` Junio C Hamano @ 2025-10-30 21:42 ` brian m. carlson 2025-10-30 21:52 ` Junio C Hamano 0 siblings, 1 reply; 101+ messages in thread From: brian m. carlson @ 2025-10-30 21:42 UTC (permalink / raw) To: Junio C Hamano; +Cc: git, Patrick Steinhardt, Ezekiel Newren [-- Attachment #1: Type: text/plain, Size: 733 bytes --] On 2025-10-29 at 16:32:50, Junio C Hamano wrote: > "brian m. carlson" <sandals@crustytoothpaste.net> writes: > > > +struct git_hash_ctx *git_hash_alloc(void) > > +{ > > + return malloc(sizeof(struct git_hash_ctx)); > > +} > > Not an objection, but this looked especially curious to me because > it has been customary to use xmalloc() for a thing like this. Going > forward, is our intention that we'd explicitly handle OOM allocation > failures ourselves, at least in the Rust part of the code base? No, I'll change this to use `xmalloc`. Rust handles allocation itself and just panics on OOM, so we will not want to handle allocation failures ourselves. -- brian m. carlson (they/them) Toronto, Ontario, CA [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 262 bytes --] ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH 09/14] hash: expose hash context functions to Rust 2025-10-30 21:42 ` brian m. carlson @ 2025-10-30 21:52 ` Junio C Hamano 0 siblings, 0 replies; 101+ messages in thread From: Junio C Hamano @ 2025-10-30 21:52 UTC (permalink / raw) To: brian m. carlson; +Cc: git, Patrick Steinhardt, Ezekiel Newren "brian m. carlson" <sandals@crustytoothpaste.net> writes: > On 2025-10-29 at 16:32:50, Junio C Hamano wrote: >> "brian m. carlson" <sandals@crustytoothpaste.net> writes: >> >> > +struct git_hash_ctx *git_hash_alloc(void) >> > +{ >> > + return malloc(sizeof(struct git_hash_ctx)); >> > +} >> >> Not an objection, but this looked especially curious to me because >> it has been customary to use xmalloc() for a thing like this. Going >> forward, is our intention that we'd explicitly handle OOM allocation >> failures ourselves, at least in the Rust part of the code base? > > No, I'll change this to use `xmalloc`. Rust handles allocation itself > and just panics on OOM, so we will not want to handle allocation > failures ourselves. Thanks. And re-reading what I wrote, it does not make much sense, as we would want the integration go in both direction. I should try hard to get out of this mentality of talking about C-part and Rust-part of the system. What is allocated in one side needs to be able to go to the other side and then come back seamlessly. ^ permalink raw reply [flat|nested] 101+ messages in thread
* [PATCH 10/14] rust: add a build.rs script for tests 2025-10-27 0:43 [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 brian m. carlson ` (8 preceding siblings ...) 2025-10-27 0:43 ` [PATCH 09/14] hash: expose hash context functions to Rust brian m. carlson @ 2025-10-27 0:44 ` brian m. carlson 2025-10-28 9:18 ` Patrick Steinhardt 2025-10-29 16:43 ` Junio C Hamano 2025-10-27 0:44 ` [PATCH 11/14] rust: add functionality to hash an object brian m. carlson ` (7 subsequent siblings) 17 siblings, 2 replies; 101+ messages in thread From: brian m. carlson @ 2025-10-27 0:44 UTC (permalink / raw) To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren Cargo uses the build.rs script to determine how to compile and link a binary. The only binary we're generating, however, is for our tests, but in a future commit, we're going to link against libgit.a for some functionality and we'll need to make sure the test binaries are complete. Add a build.rs file for this case and specify the files we're going to be linking against. Because we cannot specify different dependencies when building our static library versus our tests, update the Makefile to specify these dependencies for our static library to avoid race conditions during build. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> --- Makefile | 2 +- build.rs | 21 +++++++++++++++++++++ 2 files changed, 22 insertions(+), 1 deletion(-) create mode 100644 build.rs diff --git a/Makefile b/Makefile index 7e5a735ca6..7c36302717 100644 --- a/Makefile +++ b/Makefile @@ -2948,7 +2948,7 @@ scalar$X: scalar.o GIT-LDFLAGS $(GITLIBS) $(LIB_FILE): $(LIB_OBJS) $(QUIET_AR)$(RM) $@ && $(AR) $(ARFLAGS) $@ $^ -$(RUST_LIB): Cargo.toml $(RUST_SOURCES) +$(RUST_LIB): Cargo.toml $(RUST_SOURCES) $(XDIFF_LIB) $(LIB_FILE) $(REFTABLE_LIB) $(QUIET_CARGO)cargo build $(CARGO_ARGS) .PHONY: rust diff --git a/build.rs b/build.rs new file mode 100644 index 0000000000..136d58c35a --- /dev/null +++ b/build.rs @@ -0,0 +1,21 @@ +// This program is free software; you can redistribute it and/or modify +// it under the terms of the GNU General Public License as published by +// the Free Software Foundation: version 2 of the License, dated June 1991. +// +// This program is distributed in the hope that it will be useful, +// but WITHOUT ANY WARRANTY; without even the implied warranty of +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +// GNU General Public License for more details. +// +// You should have received a copy of the GNU General Public License along +// with this program; if not, see <https://www.gnu.org/licenses/>. + +fn main() { + println!("cargo::rustc-link-search=."); + println!("cargo::rustc-link-search=reftable"); + println!("cargo::rustc-link-search=xdiff"); + println!("cargo::rustc-link-lib=git"); + println!("cargo::rustc-link-lib=reftable"); + println!("cargo::rustc-link-lib=z"); + println!("cargo::rustc-link-lib=xdiff"); +} ^ permalink raw reply related [flat|nested] 101+ messages in thread
* Re: [PATCH 10/14] rust: add a build.rs script for tests 2025-10-27 0:44 ` [PATCH 10/14] rust: add a build.rs script for tests brian m. carlson @ 2025-10-28 9:18 ` Patrick Steinhardt 2025-10-28 17:42 ` Ezekiel Newren 2025-10-29 16:43 ` Junio C Hamano 1 sibling, 1 reply; 101+ messages in thread From: Patrick Steinhardt @ 2025-10-28 9:18 UTC (permalink / raw) To: brian m. carlson; +Cc: git, Junio C Hamano, Ezekiel Newren On Mon, Oct 27, 2025 at 12:44:00AM +0000, brian m. carlson wrote: > diff --git a/Makefile b/Makefile > index 7e5a735ca6..7c36302717 100644 > --- a/Makefile > +++ b/Makefile > @@ -2948,7 +2948,7 @@ scalar$X: scalar.o GIT-LDFLAGS $(GITLIBS) > $(LIB_FILE): $(LIB_OBJS) > $(QUIET_AR)$(RM) $@ && $(AR) $(ARFLAGS) $@ $^ > > -$(RUST_LIB): Cargo.toml $(RUST_SOURCES) > +$(RUST_LIB): Cargo.toml $(RUST_SOURCES) $(XDIFF_LIB) $(LIB_FILE) $(REFTABLE_LIB) > $(QUIET_CARGO)cargo build $(CARGO_ARGS) We have recently removed the separare xdiff and reftable libraries, so it shouldn't be necessary to have these anymore. But one thing I'm curious about: don't we have a circular dependency between the Rust and C library now? I guess that's somewhat expected, as we'll want to call Rust from C and vice versa. But on the Meson side I think we need to adjust our logic so that we don't pull the Rust library into libgit.a to break this cycle. > diff --git a/build.rs b/build.rs > new file mode 100644 > index 0000000000..136d58c35a > --- /dev/null > +++ b/build.rs > @@ -0,0 +1,21 @@ > +// This program is free software; you can redistribute it and/or modify > +// it under the terms of the GNU General Public License as published by > +// the Free Software Foundation: version 2 of the License, dated June 1991. > +// > +// This program is distributed in the hope that it will be useful, > +// but WITHOUT ANY WARRANTY; without even the implied warranty of > +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > +// GNU General Public License for more details. > +// > +// You should have received a copy of the GNU General Public License along > +// with this program; if not, see <https://www.gnu.org/licenses/>. > + > +fn main() { > + println!("cargo::rustc-link-search=."); > + println!("cargo::rustc-link-search=reftable"); > + println!("cargo::rustc-link-search=xdiff"); > + println!("cargo::rustc-link-lib=git"); > + println!("cargo::rustc-link-lib=reftable"); > + println!("cargo::rustc-link-lib=z"); > + println!("cargo::rustc-link-lib=xdiff"); > +} How do we ensure that the correct libraries are linked here? E.g. for libz, if there are multiple such libraries, which one gets precedence? Patrick ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH 10/14] rust: add a build.rs script for tests 2025-10-28 9:18 ` Patrick Steinhardt @ 2025-10-28 17:42 ` Ezekiel Newren 0 siblings, 0 replies; 101+ messages in thread From: Ezekiel Newren @ 2025-10-28 17:42 UTC (permalink / raw) To: Patrick Steinhardt; +Cc: brian m. carlson, git, Junio C Hamano On Tue, Oct 28, 2025 at 3:18 AM Patrick Steinhardt <ps@pks.im> wrote: > > On Mon, Oct 27, 2025 at 12:44:00AM +0000, brian m. carlson wrote: > > diff --git a/Makefile b/Makefile > > index 7e5a735ca6..7c36302717 100644 > > --- a/Makefile > > +++ b/Makefile > > @@ -2948,7 +2948,7 @@ scalar$X: scalar.o GIT-LDFLAGS $(GITLIBS) > > $(LIB_FILE): $(LIB_OBJS) > > $(QUIET_AR)$(RM) $@ && $(AR) $(ARFLAGS) $@ $^ > > > > -$(RUST_LIB): Cargo.toml $(RUST_SOURCES) > > +$(RUST_LIB): Cargo.toml $(RUST_SOURCES) $(XDIFF_LIB) $(LIB_FILE) $(REFTABLE_LIB) > > $(QUIET_CARGO)cargo build $(CARGO_ARGS) > > We have recently removed the separare xdiff and reftable libraries, so > it shouldn't be necessary to have these anymore. Patrick is referring to my Makefile update libgit.a patch series that has been merged into master [1]. > But one thing I'm curious about: don't we have a circular dependency > between the Rust and C library now? I guess that's somewhat expected, as > we'll want to call Rust from C and vice versa. But on the Meson side I > think we need to adjust our logic so that we don't pull the Rust library > into libgit.a to break this cycle. > > > diff --git a/build.rs b/build.rs > > new file mode 100644 > > index 0000000000..136d58c35a > > --- /dev/null > > +++ b/build.rs > > @@ -0,0 +1,21 @@ > > +// This program is free software; you can redistribute it and/or modify > > +// it under the terms of the GNU General Public License as published by > > +// the Free Software Foundation: version 2 of the License, dated June 1991. > > +// > > +// This program is distributed in the hope that it will be useful, > > +// but WITHOUT ANY WARRANTY; without even the implied warranty of > > +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > > +// GNU General Public License for more details. > > +// > > +// You should have received a copy of the GNU General Public License along > > +// with this program; if not, see <https://www.gnu.org/licenses/>. > > + > > +fn main() { > > + println!("cargo::rustc-link-search=."); > > + println!("cargo::rustc-link-search=reftable"); > > + println!("cargo::rustc-link-search=xdiff"); > > + println!("cargo::rustc-link-lib=git"); > > + println!("cargo::rustc-link-lib=reftable"); > > + println!("cargo::rustc-link-lib=z"); > > + println!("cargo::rustc-link-lib=xdiff"); > > +} > > How do we ensure that the correct libraries are linked here? E.g. for > libz, if there are multiple such libraries, which one gets precedence? I solved this problem in my own Introduce Rust series [2,3]. When Makefile or Meson is invoking Cargo it sets the environment variable `USE_LINKING=false` and build.rs doesn't link against libgit.a or any other library. When `cargo test` is called it will link against libgit.a because if USE_LINKING is not set then it assumes true. [1] Makefile update libgit.a https://lore.kernel.org/git/pull.2065.v2.git.git.1759447647.gitgitgadget@gmail.com/ [2] Ezekiel's Introduce Rust https://lore.kernel.org/git/6032a8740c0ba72420f42c3d8d801e1bdeec12d0.1758071798.git.gitgitgadget@gmail.com/ [3] Ezekiel's Introduce Rust https://lore.kernel.org/git/6a27e07e6310b6cad0e3feae817269b9b8eaed69.1758071798.git.gitgitgadget@gmail.com/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH 10/14] rust: add a build.rs script for tests 2025-10-27 0:44 ` [PATCH 10/14] rust: add a build.rs script for tests brian m. carlson 2025-10-28 9:18 ` Patrick Steinhardt @ 2025-10-29 16:43 ` Junio C Hamano 2025-10-29 22:10 ` Ezekiel Newren 1 sibling, 1 reply; 101+ messages in thread From: Junio C Hamano @ 2025-10-29 16:43 UTC (permalink / raw) To: brian m. carlson; +Cc: git, Patrick Steinhardt, Ezekiel Newren "brian m. carlson" <sandals@crustytoothpaste.net> writes: > Cargo uses the build.rs script to determine how to compile and link a > binary. The only binary we're generating, however, is for our tests, > but in a future commit, we're going to link against libgit.a for some > functionality and we'll need to make sure the test binaries are > complete. OK. > -$(RUST_LIB): Cargo.toml $(RUST_SOURCES) > +$(RUST_LIB): Cargo.toml $(RUST_SOURCES) $(XDIFF_LIB) $(LIB_FILE) $(REFTABLE_LIB) > $(QUIET_CARGO)cargo build $(CARGO_ARGS) > ... > +fn main() { > + println!("cargo::rustc-link-search=."); > + println!("cargo::rustc-link-search=reftable"); > + println!("cargo::rustc-link-search=xdiff"); > + println!("cargo::rustc-link-lib=git"); > + println!("cargo::rustc-link-lib=reftable"); > + println!("cargo::rustc-link-lib=z"); > + println!("cargo::rustc-link-lib=xdiff"); > +} Hmm, I recall Ezekiel earlier arguing to roll reftable and xdiff libraries into libgit.a as it is a lot more cumbersome to have to link with multiple libraries (sorry, I may be misremembering and do not have reference handy), but if the above is all it takes to link with these, perhaps it is not such a huge deal? I am a bit confused. XDIFF_LIB and REFTABLE_LIB are gone from Makefile on 'master' already. Perhaps we should revert earlier series from him? Thanks. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH 10/14] rust: add a build.rs script for tests 2025-10-29 16:43 ` Junio C Hamano @ 2025-10-29 22:10 ` Ezekiel Newren 2025-10-29 23:12 ` Junio C Hamano 0 siblings, 1 reply; 101+ messages in thread From: Ezekiel Newren @ 2025-10-29 22:10 UTC (permalink / raw) To: Junio C Hamano; +Cc: brian m. carlson, git, Patrick Steinhardt On Wed, Oct 29, 2025 at 10:43 AM Junio C Hamano <gitster@pobox.com> wrote: > > "brian m. carlson" <sandals@crustytoothpaste.net> writes: > > > Cargo uses the build.rs script to determine how to compile and link a > > binary. The only binary we're generating, however, is for our tests, > > but in a future commit, we're going to link against libgit.a for some > > functionality and we'll need to make sure the test binaries are > > complete. > > OK. > > > -$(RUST_LIB): Cargo.toml $(RUST_SOURCES) > > +$(RUST_LIB): Cargo.toml $(RUST_SOURCES) $(XDIFF_LIB) $(LIB_FILE) $(REFTABLE_LIB) > > $(QUIET_CARGO)cargo build $(CARGO_ARGS) > > ... > > +fn main() { > > + println!("cargo::rustc-link-search=."); > > + println!("cargo::rustc-link-search=reftable"); > > + println!("cargo::rustc-link-search=xdiff"); > > + println!("cargo::rustc-link-lib=git"); > > + println!("cargo::rustc-link-lib=reftable"); > > + println!("cargo::rustc-link-lib=z"); > > + println!("cargo::rustc-link-lib=xdiff"); > > +} > > Hmm, I recall Ezekiel earlier arguing to roll reftable and xdiff > libraries into libgit.a as it is a lot more cumbersome to have to > link with multiple libraries (sorry, I may be misremembering and do > not have reference handy), but if the above is all it takes to link > with these, perhaps it is not such a huge deal? I think Brian might have written this before my series was merged in. > I am a bit confused. > > XDIFF_LIB and REFTABLE_LIB are gone from Makefile on 'master' > already. Perhaps we should revert earlier series from him? I don't think we should revert my series. Brian should delete certain lines like so: fn main() { println!("cargo::rustc-link-search=."); - println!("cargo::rustc-link-search=reftable"); - println!("cargo::rustc-link-search=xdiff"); println!("cargo::rustc-link-lib=git"); - println!("cargo::rustc-link-lib=reftable"); println!("cargo::rustc-link-lib=z"); - println!("cargo::rustc-link-lib=xdiff"); } Also the makefile needs to add the flag -fPIC or -fPIE when compiling with Rust. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH 10/14] rust: add a build.rs script for tests 2025-10-29 22:10 ` Ezekiel Newren @ 2025-10-29 23:12 ` Junio C Hamano 2025-10-30 6:26 ` Patrick Steinhardt 0 siblings, 1 reply; 101+ messages in thread From: Junio C Hamano @ 2025-10-29 23:12 UTC (permalink / raw) To: Ezekiel Newren; +Cc: brian m. carlson, git, Patrick Steinhardt Ezekiel Newren <ezekielnewren@gmail.com> writes: >> Hmm, I recall Ezekiel earlier arguing to roll reftable and xdiff >> libraries into libgit.a as it is a lot more cumbersome to have to >> link with multiple libraries (sorry, I may be misremembering and do >> not have reference handy), but if the above is all it takes to link >> with these, perhaps it is not such a huge deal? > > I think Brian might have written this before my series was merged in. > ... >> I am a bit confused. >> >> XDIFF_LIB and REFTABLE_LIB are gone from Makefile on 'master' >> already. Perhaps we should revert earlier series from him? > ... > I don't think we should revert my series. The order of events does not really matter, does it? If we can happily link with more than one libraries [*], it would give us a much more pleasant developer experience than having to roll everything into a single library archive, no? Or are you saying that the way this series links these multiple libraries somehow does not work? You somehow manged to confuse me even more ... X-<. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH 10/14] rust: add a build.rs script for tests 2025-10-29 23:12 ` Junio C Hamano @ 2025-10-30 6:26 ` Patrick Steinhardt 2025-10-30 13:54 ` Junio C Hamano 0 siblings, 1 reply; 101+ messages in thread From: Patrick Steinhardt @ 2025-10-30 6:26 UTC (permalink / raw) To: Junio C Hamano; +Cc: Ezekiel Newren, brian m. carlson, git On Wed, Oct 29, 2025 at 04:12:05PM -0700, Junio C Hamano wrote: > Ezekiel Newren <ezekielnewren@gmail.com> writes: > > >> Hmm, I recall Ezekiel earlier arguing to roll reftable and xdiff > >> libraries into libgit.a as it is a lot more cumbersome to have to > >> link with multiple libraries (sorry, I may be misremembering and do > >> not have reference handy), but if the above is all it takes to link > >> with these, perhaps it is not such a huge deal? > > > > I think Brian might have written this before my series was merged in. > > ... > >> I am a bit confused. > >> > >> XDIFF_LIB and REFTABLE_LIB are gone from Makefile on 'master' > >> already. Perhaps we should revert earlier series from him? > > ... > > I don't think we should revert my series. > > The order of events does not really matter, does it? > > If we can happily link with more than one libraries [*], it would > give us a much more pleasant developer experience than having to > roll everything into a single library archive, no? Or are you > saying that the way this series links these multiple libraries > somehow does not work? > > You somehow manged to confuse me even more ... X-<. Simplification was only one of the reasons we had. The other reason was to unify how Meson and Makefiles build libgit.a, where the former wasn't ever building separate xdiff and reftable libraries. The question I have here is what the benefit would be to have separate libraries. I don't really see the "more pleasant developer experience", and I'm not really aware of any other benefits. So personally, I'm all for the build system simplification that Ezekiel introduced. Patrick ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH 10/14] rust: add a build.rs script for tests 2025-10-30 6:26 ` Patrick Steinhardt @ 2025-10-30 13:54 ` Junio C Hamano 2025-10-31 22:43 ` Ezekiel Newren 0 siblings, 1 reply; 101+ messages in thread From: Junio C Hamano @ 2025-10-30 13:54 UTC (permalink / raw) To: Patrick Steinhardt; +Cc: Ezekiel Newren, brian m. carlson, git Patrick Steinhardt <ps@pks.im> writes: > The question I have here is what the benefit would be to have separate > libraries. Mostly flexibility. If we do not value it, then that is OK, though. And personally I would have to say that "meson rolled everything into a single library archive" is a bad excuse---whatever came later doing things differently from the incumbent has to have a good reason to do things differently, or it is a regression. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH 10/14] rust: add a build.rs script for tests 2025-10-30 13:54 ` Junio C Hamano @ 2025-10-31 22:43 ` Ezekiel Newren 2025-11-01 11:18 ` Junio C Hamano 0 siblings, 1 reply; 101+ messages in thread From: Ezekiel Newren @ 2025-10-31 22:43 UTC (permalink / raw) To: Junio C Hamano; +Cc: Patrick Steinhardt, brian m. carlson, git On Thu, Oct 30, 2025 at 7:54 AM Junio C Hamano <gitster@pobox.com> wrote: > > Patrick Steinhardt <ps@pks.im> writes: > > > The question I have here is what the benefit would be to have separate > > libraries. > > Mostly flexibility. If we do not value it, then that is OK, though. > > And personally I would have to say that "meson rolled everything > into a single library archive" is a bad excuse---whatever came later > doing things differently from the incumbent has to have a good reason > to do things differently, or it is a regression. I don't understand why "Simplify Cargo's job of linking with the build systems of Makefile and Meson" Isn't a good enough reason by itself. Nor do I understand why having libxdiff.a and libreftable.a produces a better developer experience. My developer experience has been strictly worse because of this separation. If we keep Makefile the way that it was and change Meson to also produce separate static libraries then we'll need to keep 3 build systems in sync with each other. If we roll everything into libgit.a then Cargo only ever needs to know about that static library, Meson doesn't change, and there's no question about where new object files should be added in Makefile. If we do add a 3rd conceptual stand alone library then we'd only need to add the source files to Makefile and Meson, but if we insist on separate static libraries then we'll have to add the source files (as usual) and make sure that Makefile, Meson, and Cargo are all in agreement about the static libraries being produced. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH 10/14] rust: add a build.rs script for tests 2025-10-31 22:43 ` Ezekiel Newren @ 2025-11-01 11:18 ` Junio C Hamano 0 siblings, 0 replies; 101+ messages in thread From: Junio C Hamano @ 2025-11-01 11:18 UTC (permalink / raw) To: Ezekiel Newren; +Cc: Patrick Steinhardt, brian m. carlson, git Ezekiel Newren <ezekielnewren@gmail.com> writes: >> Mostly flexibility. If we do not value it, then that is OK, though. >> >> And personally I would have to say that "meson rolled everything >> into a single library archive" is a bad excuse---whatever came later >> doing things differently from the incumbent has to have a good reason >> to do things differently, or it is a regression. > > I don't understand why "Simplify Cargo's job of linking with the build > systems of Makefile and Meson" Isn't a good enough reason by itself. Was that the way it was sold, though? The motivation is to simplify Rust's job of linking against the C code by requiring it to only link against a single static library (libgit.a). was how the original cover letter sold the change. In addition, in a later thread, I saw this: Like the previous two commits; This one continues the effort to get the Rust compiler to link against libgit.a. Meson already includes the reftable in its libgit.a, but Makefile does not. It led me into (incorrectly) thinking that Rust toolchain you are using for your series becomes very cumbersome, if not impossible, to use, if we try to have it use more than one library. My job as the project lead would have been to decide if maintaining the separation of three independent libraries was worth the hassle. In other words, I read it as "We have to do with a single library, due to limitations of Rust build infrastructure, and that is why we are merging logically three separate libraries into one in the build structure in the Makefile. Meson based build happens to already roll everything into one library, so we do not have to do anything extra to implement this workaround for Rust. Only Makefile side needs this change." If I knew that dealing with just one library was not a requirement placed by Rust (and apparently, what brian did in the series under discussion shows that it is not), I would have instead suggested to fix the Meson based build procedure, as I do agree with the idea of "simplifying" to avoid having to deal with 1 with Meson while 3 with Makefile. But I would have suggested to link the same set of three libraries on both sides. The fact I was (mis)lead into thinking that the only way to do so is to roll objects from three logically independent libraries into one (due to limitation in building Rust part of the code), when the other way, namely, to keep them separate also in Meson based builds, was also perfectly adequate because there is no such limitation placed by Rust, is mostly what makes me react unnecessarily strongly. Yes, I am upset. When there is no strong reason to be different for a newly introduced thing (that is, Meson relative to Makefile), it should avoid being different to avoid breaking expectations (e.g., we'd have this and that .a files left in the build directory to link with objects to produce "git"). So "I do not understand why keeping three is good" is not an argument. The Meson based build series needed to justify itself why rolling everything into one library was a good idea, but it seems nobody noticed the distinction back then when it was introduced, and you do not have to be retroactively defending that mistake now. The same about position independent code generation (I do not know if it hurts performance very much these days, but it used to introduce measurable hit, so the benefit needs to outweigh the cost). In any case, it has sufficiently been long time since we lost the other two librarres in our build, so changing it back to use three separate libraries would be yet another breaking move that I do not want to see---unfortunately it is way too late for that. So brian's patch in this series may need to be rebased to a newer base to expect a single library, I think. ^ permalink raw reply [flat|nested] 101+ messages in thread
* [PATCH 11/14] rust: add functionality to hash an object 2025-10-27 0:43 [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 brian m. carlson ` (9 preceding siblings ...) 2025-10-27 0:44 ` [PATCH 10/14] rust: add a build.rs script for tests brian m. carlson @ 2025-10-27 0:44 ` brian m. carlson 2025-10-28 9:18 ` Patrick Steinhardt 2025-10-28 18:05 ` Ezekiel Newren 2025-10-27 0:44 ` [PATCH 12/14] rust: add a new binary loose object map format brian m. carlson ` (6 subsequent siblings) 17 siblings, 2 replies; 101+ messages in thread From: brian m. carlson @ 2025-10-27 0:44 UTC (permalink / raw) To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren In a future commit, we'll want to hash some data when dealing with a loose object map. Let's make this easy by creating a structure to hash objects and calling into the C functions as necessary to perform the hashing. For now, we only implement safe hashing, but in the future we could add unsafe hashing if we want. Implement Clone and Drop to appropriately manage our memory. Additionally implement Write to make it easy to use with other formats that implement this trait. While we're at it, add some tests for the various cases in this file. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> --- src/hash.rs | 157 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 157 insertions(+) diff --git a/src/hash.rs b/src/hash.rs index a5b9493bd8..8798a50aef 100644 --- a/src/hash.rs +++ b/src/hash.rs @@ -10,6 +10,7 @@ // You should have received a copy of the GNU General Public License along // with this program; if not, see <https://www.gnu.org/licenses/>. +use std::io::{self, Write}; use std::os::raw::c_void; pub const GIT_MAX_RAWSZ: usize = 32; @@ -39,6 +40,81 @@ impl ObjectID { } } +pub struct Hasher { + algo: HashAlgorithm, + safe: bool, + ctx: *mut c_void, +} + +impl Hasher { + /// Create a new safe hasher. + pub fn new(algo: HashAlgorithm) -> Hasher { + let ctx = unsafe { c::git_hash_alloc() }; + unsafe { c::git_hash_init(ctx, algo.hash_algo_ptr()) }; + Hasher { + algo, + safe: true, + ctx, + } + } + + /// Return whether this is a safe hasher. + pub fn is_safe(&self) -> bool { + self.safe + } + + /// Update the hasher with the specified data. + pub fn update(&mut self, data: &[u8]) { + unsafe { c::git_hash_update(self.ctx, data.as_ptr() as *const c_void, data.len()) }; + } + + /// Return an object ID, consuming the hasher. + pub fn into_oid(self) -> ObjectID { + let mut oid = ObjectID { + hash: [0u8; 32], + algo: self.algo as u32, + }; + unsafe { c::git_hash_final_oid(&mut oid as *mut ObjectID as *mut c_void, self.ctx) }; + oid + } + + /// Return a hash as a `Vec`, consuming the hasher. + pub fn into_vec(self) -> Vec<u8> { + let mut v = vec![0u8; self.algo.raw_len()]; + unsafe { c::git_hash_final(v.as_mut_ptr(), self.ctx) }; + v + } +} + +impl Write for Hasher { + fn write(&mut self, data: &[u8]) -> io::Result<usize> { + self.update(data); + Ok(data.len()) + } + + fn flush(&mut self) -> io::Result<()> { + Ok(()) + } +} + +impl Clone for Hasher { + fn clone(&self) -> Hasher { + let ctx = unsafe { c::git_hash_alloc() }; + unsafe { c::git_hash_clone(ctx, self.ctx) }; + Hasher { + algo: self.algo, + safe: self.safe, + ctx, + } + } +} + +impl Drop for Hasher { + fn drop(&mut self) { + unsafe { c::git_hash_free(self.ctx) }; + } +} + /// A hash algorithm, #[repr(C)] #[derive(Debug, Copy, Clone, Ord, PartialOrd, Eq, PartialEq)] @@ -167,6 +243,11 @@ impl HashAlgorithm { pub fn hash_algo_ptr(self) -> *const c_void { unsafe { c::hash_algo_ptr_by_offset(self as u32) } } + + /// Create a hasher for this algorithm. + pub fn hasher(self) -> Hasher { + Hasher::new(self) + } } pub mod c { @@ -174,5 +255,81 @@ pub mod c { extern "C" { pub fn hash_algo_ptr_by_offset(n: u32) -> *const c_void; + pub fn unsafe_hash_algo(algop: *const c_void) -> *const c_void; + pub fn git_hash_alloc() -> *mut c_void; + pub fn git_hash_free(ctx: *mut c_void); + pub fn git_hash_init(dst: *mut c_void, algop: *const c_void); + pub fn git_hash_clone(dst: *mut c_void, src: *const c_void); + pub fn git_hash_update(ctx: *mut c_void, inp: *const c_void, len: usize); + pub fn git_hash_final(hash: *mut u8, ctx: *mut c_void); + pub fn git_hash_final_oid(hash: *mut c_void, ctx: *mut c_void); + } +} + +#[cfg(test)] +mod tests { + use super::{HashAlgorithm, ObjectID}; + use std::io::Write; + + fn all_algos() -> &'static [HashAlgorithm] { + &[HashAlgorithm::SHA1, HashAlgorithm::SHA256] + } + + #[test] + fn format_id_round_trips() { + for algo in all_algos() { + assert_eq!( + *algo, + HashAlgorithm::from_format_id(algo.format_id()).unwrap() + ); + } + } + + #[test] + fn offset_round_trips() { + for algo in all_algos() { + assert_eq!(*algo, HashAlgorithm::from_u32(*algo as u32).unwrap()); + } + } + + #[test] + fn slices_have_correct_length() { + for algo in all_algos() { + for oid in [algo.null_oid(), algo.empty_blob(), algo.empty_tree()] { + assert_eq!(oid.as_slice().len(), algo.raw_len()); + } + } + } + + #[test] + fn hasher_works_correctly() { + for algo in all_algos() { + let tests: &[(&[u8], &ObjectID)] = &[ + (b"blob 0\0", algo.empty_blob()), + (b"tree 0\0", algo.empty_tree()), + ]; + for (data, oid) in tests { + let mut h = algo.hasher(); + assert_eq!(h.is_safe(), true); + // Test that this works incrementally. + h.update(&data[0..2]); + h.update(&data[2..]); + + let h2 = h.clone(); + + let actual_oid = h.into_oid(); + assert_eq!(**oid, actual_oid); + + let v = h2.into_vec(); + assert_eq!((*oid).as_slice(), &v); + + let mut h = algo.hasher(); + h.write_all(&data[0..2]).unwrap(); + h.write_all(&data[2..]).unwrap(); + + let actual_oid = h.into_oid(); + assert_eq!(**oid, actual_oid); + } + } } } ^ permalink raw reply related [flat|nested] 101+ messages in thread
* Re: [PATCH 11/14] rust: add functionality to hash an object 2025-10-27 0:44 ` [PATCH 11/14] rust: add functionality to hash an object brian m. carlson @ 2025-10-28 9:18 ` Patrick Steinhardt 2025-10-29 0:53 ` brian m. carlson 2025-10-28 18:05 ` Ezekiel Newren 1 sibling, 1 reply; 101+ messages in thread From: Patrick Steinhardt @ 2025-10-28 9:18 UTC (permalink / raw) To: brian m. carlson; +Cc: git, Junio C Hamano, Ezekiel Newren On Mon, Oct 27, 2025 at 12:44:01AM +0000, brian m. carlson wrote: > In a future commit, we'll want to hash some data when dealing with a > loose object map. Let's make this easy by creating a structure to hash > objects and calling into the C functions as necessary to perform the > hashing. For now, we only implement safe hashing, but in the future we > could add unsafe hashing if we want. Implement Clone and Drop to > appropriately manage our memory. Additionally implement Write to make > it easy to use with other formats that implement this trait. What exactly do you mean with "safe" and "unsafe" hashing? Also, can't we drop this distinction for now until we have a need for it? > diff --git a/src/hash.rs b/src/hash.rs > index a5b9493bd8..8798a50aef 100644 > --- a/src/hash.rs > +++ b/src/hash.rs > @@ -39,6 +40,81 @@ impl ObjectID { > } > } > > +pub struct Hasher { > + algo: HashAlgorithm, > + safe: bool, > + ctx: *mut c_void, > +} Nit: missing documentation. > +impl Hasher { > + /// Create a new safe hasher. > + pub fn new(algo: HashAlgorithm) -> Hasher { > + let ctx = unsafe { c::git_hash_alloc() }; > + unsafe { c::git_hash_init(ctx, algo.hash_algo_ptr()) }; I already noticed this in the patch that introduced this, but wouldn't it make sense to expose `git_hash_new()` instead of the combination of `alloc() + init()`? > + Hasher { > + algo, > + safe: true, > + ctx, > + } > + } > + > + /// Return whether this is a safe hasher. > + pub fn is_safe(&self) -> bool { > + self.safe > + } > + > + /// Update the hasher with the specified data. > + pub fn update(&mut self, data: &[u8]) { > + unsafe { c::git_hash_update(self.ctx, data.as_ptr() as *const c_void, data.len()) }; > + } > + > + /// Return an object ID, consuming the hasher. > + pub fn into_oid(self) -> ObjectID { > + let mut oid = ObjectID { > + hash: [0u8; 32], > + algo: self.algo as u32, > + }; > + unsafe { c::git_hash_final_oid(&mut oid as *mut ObjectID as *mut c_void, self.ctx) }; > + oid > + } > + > + /// Return a hash as a `Vec`, consuming the hasher. > + pub fn into_vec(self) -> Vec<u8> { > + let mut v = vec![0u8; self.algo.raw_len()]; > + unsafe { c::git_hash_final(v.as_mut_ptr(), self.ctx) }; > + v > + } > +} > + > +impl Write for Hasher { > + fn write(&mut self, data: &[u8]) -> io::Result<usize> { > + self.update(data); > + Ok(data.len()) > + } > + > + fn flush(&mut self) -> io::Result<()> { > + Ok(()) > + } > +} Yup, sensible to implement this interface. > +impl Clone for Hasher { > + fn clone(&self) -> Hasher { > + let ctx = unsafe { c::git_hash_alloc() }; > + unsafe { c::git_hash_clone(ctx, self.ctx) }; > + Hasher { > + algo: self.algo, > + safe: self.safe, > + ctx, > + } > + } > +} Makes sense. > +impl Drop for Hasher { > + fn drop(&mut self) { > + unsafe { c::git_hash_free(self.ctx) }; > + } > +} Likewise. Patrick ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH 11/14] rust: add functionality to hash an object 2025-10-28 9:18 ` Patrick Steinhardt @ 2025-10-29 0:53 ` brian m. carlson 2025-10-29 9:07 ` Patrick Steinhardt 0 siblings, 1 reply; 101+ messages in thread From: brian m. carlson @ 2025-10-29 0:53 UTC (permalink / raw) To: Patrick Steinhardt; +Cc: git, Junio C Hamano, Ezekiel Newren [-- Attachment #1: Type: text/plain, Size: 2153 bytes --] On 2025-10-28 at 09:18:26, Patrick Steinhardt wrote: > On Mon, Oct 27, 2025 at 12:44:01AM +0000, brian m. carlson wrote: > > In a future commit, we'll want to hash some data when dealing with a > > loose object map. Let's make this easy by creating a structure to hash > > objects and calling into the C functions as necessary to perform the > > hashing. For now, we only implement safe hashing, but in the future we > > could add unsafe hashing if we want. Implement Clone and Drop to > > appropriately manage our memory. Additionally implement Write to make > > it easy to use with other formats that implement this trait. > > What exactly do you mean with "safe" and "unsafe" hashing? Also, can't > we drop this distinction for now until we have a need for it? It's from the series that Taylor introduced. For SHA-1, safe hashing (the default) uses SHA-1-DC, but unsafe hashing, which does not operate on untrusted data (say, when we're writing a packfile we've created), may use a faster algorithm. See `git_hash_sha1_init_unsafe`. I can omit the `safe` attribute until we need it, sure. > > diff --git a/src/hash.rs b/src/hash.rs > > index a5b9493bd8..8798a50aef 100644 > > --- a/src/hash.rs > > +++ b/src/hash.rs > > @@ -39,6 +40,81 @@ impl ObjectID { > > } > > } > > > > +pub struct Hasher { > > + algo: HashAlgorithm, > > + safe: bool, > > + ctx: *mut c_void, > > +} > > Nit: missing documentation. Will fix in v2. > > +impl Hasher { > > + /// Create a new safe hasher. > > + pub fn new(algo: HashAlgorithm) -> Hasher { > > + let ctx = unsafe { c::git_hash_alloc() }; > > + unsafe { c::git_hash_init(ctx, algo.hash_algo_ptr()) }; > > I already noticed this in the patch that introduced this, but wouldn't > it make sense to expose `git_hash_new()` instead of the combination of > `alloc() + init()`? The benefit to this approach is that it allows us to reset a state in the future if we want. If we don't think that's necessary, I can certainly switch to `git_hash_new` if we prefer. -- brian m. carlson (they/them) Toronto, Ontario, CA [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 262 bytes --] ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH 11/14] rust: add functionality to hash an object 2025-10-29 0:53 ` brian m. carlson @ 2025-10-29 9:07 ` Patrick Steinhardt 0 siblings, 0 replies; 101+ messages in thread From: Patrick Steinhardt @ 2025-10-29 9:07 UTC (permalink / raw) To: brian m. carlson, git, Junio C Hamano, Ezekiel Newren On Wed, Oct 29, 2025 at 12:53:20AM +0000, brian m. carlson wrote: > On 2025-10-28 at 09:18:26, Patrick Steinhardt wrote: > > On Mon, Oct 27, 2025 at 12:44:01AM +0000, brian m. carlson wrote: > > > In a future commit, we'll want to hash some data when dealing with a > > > loose object map. Let's make this easy by creating a structure to hash > > > objects and calling into the C functions as necessary to perform the > > > hashing. For now, we only implement safe hashing, but in the future we > > > could add unsafe hashing if we want. Implement Clone and Drop to > > > appropriately manage our memory. Additionally implement Write to make > > > it easy to use with other formats that implement this trait. > > > > What exactly do you mean with "safe" and "unsafe" hashing? Also, can't > > we drop this distinction for now until we have a need for it? > > It's from the series that Taylor introduced. For SHA-1, safe hashing > (the default) uses SHA-1-DC, but unsafe hashing, which does not operate > on untrusted data (say, when we're writing a packfile we've created), > may use a faster algorithm. See `git_hash_sha1_init_unsafe`. > > I can omit the `safe` attribute until we need it, sure. Ah, I completely forgot about that distinction! Makes sense. > > > +impl Hasher { > > > + /// Create a new safe hasher. > > > + pub fn new(algo: HashAlgorithm) -> Hasher { > > > + let ctx = unsafe { c::git_hash_alloc() }; > > > + unsafe { c::git_hash_init(ctx, algo.hash_algo_ptr()) }; > > > > I already noticed this in the patch that introduced this, but wouldn't > > it make sense to expose `git_hash_new()` instead of the combination of > > `alloc() + init()`? > > The benefit to this approach is that it allows us to reset a state in > the future if we want. If we don't think that's necessary, I can > certainly switch to `git_hash_new` if we prefer. Hm, fair. I don't mind it much either way. Patrick ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH 11/14] rust: add functionality to hash an object 2025-10-27 0:44 ` [PATCH 11/14] rust: add functionality to hash an object brian m. carlson 2025-10-28 9:18 ` Patrick Steinhardt @ 2025-10-28 18:05 ` Ezekiel Newren 2025-10-29 1:05 ` brian m. carlson 1 sibling, 1 reply; 101+ messages in thread From: Ezekiel Newren @ 2025-10-28 18:05 UTC (permalink / raw) To: brian m. carlson; +Cc: git, Junio C Hamano, Patrick Steinhardt On Sun, Oct 26, 2025 at 6:44 PM brian m. carlson <sandals@crustytoothpaste.net> wrote: > > In a future commit, we'll want to hash some data when dealing with a > loose object map. Let's make this easy by creating a structure to hash > objects and calling into the C functions as necessary to perform the > hashing. For now, we only implement safe hashing, but in the future we > could add unsafe hashing if we want. Implement Clone and Drop to > appropriately manage our memory. Additionally implement Write to make > it easy to use with other formats that implement this trait. > > While we're at it, add some tests for the various cases in this file. > > Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> > --- > src/hash.rs | 157 ++++++++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 157 insertions(+) > > diff --git a/src/hash.rs b/src/hash.rs > index a5b9493bd8..8798a50aef 100644 > --- a/src/hash.rs > +++ b/src/hash.rs > @@ -10,6 +10,7 @@ > // You should have received a copy of the GNU General Public License along > // with this program; if not, see <https://www.gnu.org/licenses/>. > > +use std::io::{self, Write}; > use std::os::raw::c_void; > > pub const GIT_MAX_RAWSZ: usize = 32; > @@ -39,6 +40,81 @@ impl ObjectID { > } > } > > +pub struct Hasher { > + algo: HashAlgorithm, > + safe: bool, > + ctx: *mut c_void, > +} The name _Hasher_ is already used by std::hash::Hasher. It would be preferable to pick a different name to avoid confusion. Perhaps CryptoHasher, SecureHasher? > +impl Hasher { > + /// Create a new safe hasher. > + pub fn new(algo: HashAlgorithm) -> Hasher { > + let ctx = unsafe { c::git_hash_alloc() }; > + unsafe { c::git_hash_init(ctx, algo.hash_algo_ptr()) }; > + Hasher { > + algo, > + safe: true, > + ctx, > + } > + } - pub fn new(algo: HashAlgorithm) -> Hasher { + pub fn new(algo: HashAlgorithm) -> Self { let ctx = unsafe { c::git_hash_alloc() }; unsafe { c::git_hash_init(ctx, algo.hash_algo_ptr()) }; - Hasher { + Self { algo, safe: true, ctx, } > + /// Return whether this is a safe hasher. > + pub fn is_safe(&self) -> bool { > + self.safe > + } I don't understand the point in being able to query whether a given hasher is safe or not. How does that change how this hasher code is used? If the functions are safe then you wouldn't wrap it in an unsafe block. If the functions are declared with unsafe then you'd always need to wrap it in an unsafe block whether it's actually safe or not. Using unsafe in Rust isn't like error handling where you do something different on failure. If something fails in unsafe it's usually unrecoverable e.g. segfault due to invalid memory access. My understanding of unsafe in Rust means "The compiler can't verify that this code is actually safe to run, so I've made sure that it is safe myself and I'll let the compiler know what code to ignore during compilation." > + /// Update the hasher with the specified data. > + pub fn update(&mut self, data: &[u8]) { > + unsafe { c::git_hash_update(self.ctx, data.as_ptr() as *const c_void, data.len()) }; > + } > + > + /// Return an object ID, consuming the hasher. > + pub fn into_oid(self) -> ObjectID { > + let mut oid = ObjectID { > + hash: [0u8; 32], > + algo: self.algo as u32, > + }; > + unsafe { c::git_hash_final_oid(&mut oid as *mut ObjectID as *mut c_void, self.ctx) }; > + oid > + } > + > + /// Return a hash as a `Vec`, consuming the hasher. > + pub fn into_vec(self) -> Vec<u8> { > + let mut v = vec![0u8; self.algo.raw_len()]; > + unsafe { c::git_hash_final(v.as_mut_ptr(), self.ctx) }; > + v > + } > +} > + > +impl Write for Hasher { > + fn write(&mut self, data: &[u8]) -> io::Result<usize> { > + self.update(data); > + Ok(data.len()) > + } > + > + fn flush(&mut self) -> io::Result<()> { > + Ok(()) > + } > +} > + > +impl Clone for Hasher { > + fn clone(&self) -> Hasher { > + let ctx = unsafe { c::git_hash_alloc() }; > + unsafe { c::git_hash_clone(ctx, self.ctx) }; > + Hasher { > + algo: self.algo, > + safe: self.safe, > + ctx, > + } > + } > +} > + > +impl Drop for Hasher { > + fn drop(&mut self) { > + unsafe { c::git_hash_free(self.ctx) }; > + } > +} Make sense. > /// A hash algorithm, > #[repr(C)] > #[derive(Debug, Copy, Clone, Ord, PartialOrd, Eq, PartialEq)] > @@ -167,6 +243,11 @@ impl HashAlgorithm { > pub fn hash_algo_ptr(self) -> *const c_void { > unsafe { c::hash_algo_ptr_by_offset(self as u32) } > } > + > + /// Create a hasher for this algorithm. > + pub fn hasher(self) -> Hasher { > + Hasher::new(self) > + } > } > > pub mod c { > @@ -174,5 +255,81 @@ pub mod c { > > extern "C" { > pub fn hash_algo_ptr_by_offset(n: u32) -> *const c_void; > + pub fn unsafe_hash_algo(algop: *const c_void) -> *const c_void; > + pub fn git_hash_alloc() -> *mut c_void; > + pub fn git_hash_free(ctx: *mut c_void); > + pub fn git_hash_init(dst: *mut c_void, algop: *const c_void); > + pub fn git_hash_clone(dst: *mut c_void, src: *const c_void); > + pub fn git_hash_update(ctx: *mut c_void, inp: *const c_void, len: usize); > + pub fn git_hash_final(hash: *mut u8, ctx: *mut c_void); > + pub fn git_hash_final_oid(hash: *mut c_void, ctx: *mut c_void); > + } > +} > + > +#[cfg(test)] > +mod tests { > + use super::{HashAlgorithm, ObjectID}; > + use std::io::Write; > + > + fn all_algos() -> &'static [HashAlgorithm] { > + &[HashAlgorithm::SHA1, HashAlgorithm::SHA256] > + } > + > + #[test] > + fn format_id_round_trips() { > + for algo in all_algos() { > + assert_eq!( > + *algo, > + HashAlgorithm::from_format_id(algo.format_id()).unwrap() > + ); > + } > + } > + > + #[test] > + fn offset_round_trips() { > + for algo in all_algos() { > + assert_eq!(*algo, HashAlgorithm::from_u32(*algo as u32).unwrap()); > + } > + } > + > + #[test] > + fn slices_have_correct_length() { > + for algo in all_algos() { > + for oid in [algo.null_oid(), algo.empty_blob(), algo.empty_tree()] { > + assert_eq!(oid.as_slice().len(), algo.raw_len()); > + } > + } > + } > + > + #[test] > + fn hasher_works_correctly() { > + for algo in all_algos() { > + let tests: &[(&[u8], &ObjectID)] = &[ > + (b"blob 0\0", algo.empty_blob()), > + (b"tree 0\0", algo.empty_tree()), > + ]; > + for (data, oid) in tests { > + let mut h = algo.hasher(); > + assert_eq!(h.is_safe(), true); > + // Test that this works incrementally. > + h.update(&data[0..2]); > + h.update(&data[2..]); > + > + let h2 = h.clone(); > + > + let actual_oid = h.into_oid(); > + assert_eq!(**oid, actual_oid); > + > + let v = h2.into_vec(); > + assert_eq!((*oid).as_slice(), &v); > + > + let mut h = algo.hasher(); > + h.write_all(&data[0..2]).unwrap(); > + h.write_all(&data[2..]).unwrap(); > + > + let actual_oid = h.into_oid(); > + assert_eq!(**oid, actual_oid); > + } > + } > } > } Looks good. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH 11/14] rust: add functionality to hash an object 2025-10-28 18:05 ` Ezekiel Newren @ 2025-10-29 1:05 ` brian m. carlson 2025-10-29 16:02 ` Ben Knoble 0 siblings, 1 reply; 101+ messages in thread From: brian m. carlson @ 2025-10-29 1:05 UTC (permalink / raw) To: Ezekiel Newren; +Cc: git, Junio C Hamano, Patrick Steinhardt [-- Attachment #1: Type: text/plain, Size: 2422 bytes --] On 2025-10-28 at 18:05:59, Ezekiel Newren wrote: > The name _Hasher_ is already used by std::hash::Hasher. It would be > preferable to pick a different name to avoid confusion. Perhaps > CryptoHasher, SecureHasher? Sure, I can pick a different name if you like. There are also myriad `Result` values in Rust: `std::result::Result`, `std::fmt::Result`, `std::io::Result`, etc., so I don't see a huge problem with it, but as I said, I can change it if folks prefer. > I don't understand the point in being able to query whether a given > hasher is safe or not. How does that change how this hasher code is > used? If the functions are safe then you wouldn't wrap it in an unsafe > block. If the functions are declared with unsafe then you'd always > need to wrap it in an unsafe block whether it's actually safe or not. > Using unsafe in Rust isn't like error handling where you do something > different on failure. If something fails in unsafe it's usually > unrecoverable e.g. segfault due to invalid memory access. My > understanding of unsafe in Rust means "The compiler can't verify that > this code is actually safe to run, so I've made sure that it is safe > myself and I'll let the compiler know what code to ignore during > compilation." This is not like `unsafe` in Rust. We have some SHA-1 functions that are safe (the default ones) that use SHA-1-DC to detect collisions. People may also compile their Git version with a faster version of SHA-1 that doesn't detect collisions and that may use hardware acceleration in cases where we're not dealing with untrusted data. Taylor benchmarked it and got some pretty nice performance improvements. My preference personally was to simply say, "SHA-1 is slow since it's insecure; use SHA-256 if you want hardware acceleration and good performance," but my advice was not heeded. So this allows us to do something like `assert!(hash.is_safe())` in certain code where we know we have untrusted data to make sure we haven't been passed a Hasher that has been incorrectly initialized. We have some code paths which can accept either (and, depending on which mode they're operating in, do or don't need a safe hasher), so separate types are less convenient. We could do that, however, but it would make things more complicated and we'd need a trait that covers both. -- brian m. carlson (they/them) Toronto, Ontario, CA [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 262 bytes --] ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH 11/14] rust: add functionality to hash an object 2025-10-29 1:05 ` brian m. carlson @ 2025-10-29 16:02 ` Ben Knoble 0 siblings, 0 replies; 101+ messages in thread From: Ben Knoble @ 2025-10-29 16:02 UTC (permalink / raw) To: brian m. carlson; +Cc: Ezekiel Newren, git, Junio C Hamano, Patrick Steinhardt > Le 28 oct. 2025 à 21:06, brian m. carlson <sandals@crustytoothpaste.net> a écrit : > > On 2025-10-28 at 18:05:59, Ezekiel Newren wrote: >> The name _Hasher_ is already used by std::hash::Hasher. It would be >> preferable to pick a different name to avoid confusion. Perhaps >> CryptoHasher, SecureHasher? > > Sure, I can pick a different name if you like. There are also myriad > `Result` values in Rust: `std::result::Result`, `std::fmt::Result`, > `std::io::Result`, etc., so I don't see a huge problem with it, but as I > said, I can change it if folks prefer. > >> I don't understand the point in being able to query whether a given >> hasher is safe or not. How does that change how this hasher code is >> used? If the functions are safe then you wouldn't wrap it in an unsafe >> block. If the functions are declared with unsafe then you'd always >> need to wrap it in an unsafe block whether it's actually safe or not. >> Using unsafe in Rust isn't like error handling where you do something >> different on failure. If something fails in unsafe it's usually >> unrecoverable e.g. segfault due to invalid memory access. My >> understanding of unsafe in Rust means "The compiler can't verify that >> this code is actually safe to run, so I've made sure that it is safe >> myself and I'll let the compiler know what code to ignore during >> compilation." > > This is not like `unsafe` in Rust. We have some SHA-1 functions that > are safe (the default ones) that use SHA-1-DC to detect collisions. > People may also compile their Git version with a faster version of SHA-1 > that doesn't detect collisions and that may use hardware acceleration in > cases where we're not dealing with untrusted data. Taylor benchmarked > it and got some pretty nice performance improvements. > > My preference personally was to simply say, "SHA-1 is slow since it's > insecure; use SHA-256 if you want hardware acceleration and good > performance," but my advice was not heeded. > > So this allows us to do something like `assert!(hash.is_safe())` in > certain code where we know we have untrusted data to make sure we > haven't been passed a Hasher that has been incorrectly initialized. We > have some code paths which can accept either (and, depending on which > mode they're operating in, do or don't need a safe hasher), so separate > types are less convenient. We could do that, however, but it would make > things more complicated and we'd need a trait that covers both. > -- > brian m. carlson (they/them) > Toronto, Ontario, CA > <signature.asc> Given the confusion on the names, perhaps some docs in the code helps? Or maybe it’s already doc’d over by the FFI type, in which case a note may suffice— “Safe” here is about the hashing algorithm and (un)trusted data, not Rust memory safety. See XYZ for more details. ^ permalink raw reply [flat|nested] 101+ messages in thread
* [PATCH 12/14] rust: add a new binary loose object map format 2025-10-27 0:43 [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 brian m. carlson ` (10 preceding siblings ...) 2025-10-27 0:44 ` [PATCH 11/14] rust: add functionality to hash an object brian m. carlson @ 2025-10-27 0:44 ` brian m. carlson 2025-10-28 9:18 ` Patrick Steinhardt ` (2 more replies) 2025-10-27 0:44 ` [PATCH 13/14] rust: add a small wrapper around the hashfile code brian m. carlson ` (5 subsequent siblings) 17 siblings, 3 replies; 101+ messages in thread From: brian m. carlson @ 2025-10-27 0:44 UTC (permalink / raw) To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren Our current loose object format has a few problems. First, it is not efficient: the list of object IDs is not sorted and even if it were, there would not be an efficient way to look up objects in both algorithms. Second, we need to store mappings for things which are not technically loose objects but are not packed objects, either, and so cannot be stored in a pack index. These kinds of things include shallows, their parents, and their trees, as well as submodules. Yet we also need to implement a sensible way to store the kind of object so that we can prune unneeded entries. For instance, if the user has updated the shallows, we can remove the old values. For these reasons, introduce a new binary loose object map format. The careful reader will notice that it resembles very closely the pack index v3 format. Add an in-memory loose object map as well, and allow enabling writing to a batched map, which can then be written later as one of the binary loose object maps. Include several tests for round tripping and data lookup across algorithms. Note that the use of this code elsewhere in Git will involve some C code and some C-compatible code in Rust that will be introduced in a future commit. Thus, for example, we ignore the fact that if there is no current batch and the caller asks for data to be written, this code does nothing, mostly because this code also does not involve itself with opening or manipulating files. The C code that we will add later will implement this functionality at a higher level and take care of this, since the code which is necessary for writing to the object store is deeply involved with our C abstractions and it would require extensive work (which would not be especially valuable at this point) to port those to Rust. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> --- Documentation/gitformat-loose.adoc | 104 ++++ Makefile | 1 + src/lib.rs | 1 + src/loose.rs | 912 +++++++++++++++++++++++++++++ src/meson.build | 1 + 5 files changed, 1019 insertions(+) create mode 100644 src/loose.rs diff --git a/Documentation/gitformat-loose.adoc b/Documentation/gitformat-loose.adoc index 947993663e..4850c91669 100644 --- a/Documentation/gitformat-loose.adoc +++ b/Documentation/gitformat-loose.adoc @@ -10,6 +10,8 @@ SYNOPSIS -------- [verse] $GIT_DIR/objects/[0-9a-f][0-9a-f]/* +$GIT_DIR/objects/loose-object-idx +$GIT_DIR/objects/loose-map/map-*.map DESCRIPTION ----------- @@ -48,6 +50,108 @@ stored under Similarly, a blob containing the contents `abc` would have the uncompressed data of `blob 3\0abc`. +== Loose object mapping + +When the `compatObjectFormat` option is used, Git needs to store a mapping +between the repository's main algorithm and the compatibility algorithm. There +are two formats for this: the legacy mapping and the modern mapping. + +=== Legacy mapping + +The compatibility mapping is stored in a file called +`$GIT_DIR/objects/loose-object-idx`. The format of this file looks like this: + + # loose-object-idx + (main-name SP compat-name LF)* + +`main-name` refers to hexadecimal object ID of the object in the main +repository format and `compat-name` refers to the same thing, but for the +compatibility format. + +This format is read if it exists but is not written. + +Note that carriage returns are not permitted in this file, regardless of the +host system or configuration. + +=== Modern mapping + +The modern mapping consists of a set of files under `$GIT_DIR/objects/loose` +ending in `.map`. The portion of the filename before the extension is that of +the hash checksum in hex format. + +`git pack-objects` will repack existing entries into one file, removing any +unnecessary objects, such as obsolete shallow entries or loose objects that +have been packed. + +==== Mapping file format + +- A header appears at the beginning and consists of the following: + * A 4-byte mapping signature: `LMAP` + * 4-byte version number: 1 + * 4-byte length of the header section. + * 4-byte number of objects declared in this map file. + * 4-byte number of object formats declared in this map file. + * For each object format: + ** 4-byte format identifier (e.g., `sha1` for SHA-1) + ** 4-byte length in bytes of shortened object names. This is the + shortest possible length needed to make names in the shortened + object name table unambiguous. + ** 8-byte integer, recording where tables relating to this format + are stored in this index file, as an offset from the beginning. + * 8-byte offset to the trailer from the beginning of this file. + * Zero or more additional key/value pairs (4-byte key, 4-byte value), which + may optionally declare one or more chunks. No chunks are currently + defined. Readers must ignore unrecognized keys. +- Zero or more NUL bytes. These are used to improve the alignment of the + 4-byte quantities below. +- Tables for the first object format: + * A sorted table of shortened object names. These are prefixes of the names + of all objects in this file, packed together without offset values to + reduce the cache footprint of the binary search for a specific object name. + * A sorted table of full object names. + * A table of 4-byte metadata values. + * Zero or more chunks. A chunk starts with a four-byte chunk identifier and + a four-byte parameter (which, if unneeded, is all zeros) and an eight-byte + size (not including the identifier, parameter, or size), plus the chunk + data. +- Zero or more NUL bytes. +- Tables for subsequent object formats: + * A sorted table of shortened object names. These are prefixes of the names + of all objects in this file, packed together without offset values to + reduce the cache footprint of the binary search for a specific object name. + * A table of full object names in the order specified by the first object format. + * A table of 4-byte values mapping object name order to the order of the + first object format. For an object in the table of sorted shortened object + names, the value at the corresponding index in this table is the index in + the previous table for that same object. + * Zero or more NUL bytes. +- The trailer consists of the following: + * Hash checksum of all of the above. + +The lower six bits of each metadata table contain a type field indicating the +reason that this object is stored: + +0:: + Reserved. +1:: + This object is stored as a loose object in the repository. +2:: + This object is a shallow entry. The mapping refers to a shallow value + returned by a remote server. +3:: + This object is a submodule entry. The mapping refers to the commit stored + representing a submodule. + +Other data may be stored in this field in the future. Bits that are not used +must be zero. + +All 4-byte numbers are in network order and must be 4-byte aligned in the file, +so the NUL padding may be required in some cases. + +Note that the hash at the end of the file is in whatever the repository's main +algorithm is. In the usual case when there are multiple algorithms, the main +algorithm will be SHA-256 and the compatibility algorithm will be SHA-1. + GIT --- Part of the linkgit:git[1] suite diff --git a/Makefile b/Makefile index 7c36302717..2081b13780 100644 --- a/Makefile +++ b/Makefile @@ -1523,6 +1523,7 @@ UNIT_TEST_OBJS += $(UNIT_TEST_DIR)/test-lib.o RUST_SOURCES += src/hash.rs RUST_SOURCES += src/lib.rs +RUST_SOURCES += src/loose.rs RUST_SOURCES += src/varint.rs GIT-VERSION-FILE: FORCE diff --git a/src/lib.rs b/src/lib.rs index cf7c962509..442f9433dc 100644 --- a/src/lib.rs +++ b/src/lib.rs @@ -1,2 +1,3 @@ pub mod hash; +pub mod loose; pub mod varint; diff --git a/src/loose.rs b/src/loose.rs new file mode 100644 index 0000000000..a4e7d2fa48 --- /dev/null +++ b/src/loose.rs @@ -0,0 +1,912 @@ +// This program is free software; you can redistribute it and/or modify +// it under the terms of the GNU General Public License as published by +// the Free Software Foundation: version 2 of the License, dated June 1991. +// +// This program is distributed in the hope that it will be useful, +// but WITHOUT ANY WARRANTY; without even the implied warranty of +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +// GNU General Public License for more details. +// +// You should have received a copy of the GNU General Public License along +// with this program; if not, see <https://www.gnu.org/licenses/>. + +use crate::hash::{HashAlgorithm, ObjectID, GIT_MAX_RAWSZ}; +use std::collections::BTreeMap; +use std::convert::TryInto; +use std::io::{self, Write}; + +/// The type of object stored in the map. +/// +/// If this value is `Reserved`, then it is never written to disk and is used primarily to store +/// certain hard-coded objects, like the empty tree, empty blob, or null object ID. +/// +/// If this value is `LooseObject`, then this represents a loose object. `Shallow` represents a +/// shallow commit, its parent, or its tree. `Submodule` represents a submodule commit. +#[repr(C)] +#[derive(Debug, Clone, Copy, Ord, PartialOrd, Eq, PartialEq)] +pub enum MapType { + Reserved = 0, + LooseObject = 1, + Shallow = 2, + Submodule = 3, +} + +impl MapType { + pub fn from_u32(n: u32) -> Option<MapType> { + match n { + 0 => Some(Self::Reserved), + 1 => Some(Self::LooseObject), + 2 => Some(Self::Shallow), + 3 => Some(Self::Submodule), + _ => None, + } + } +} + +/// The value of an object stored in a `LooseObjectMemoryMap`. +/// +/// This keeps the object ID to which the key is mapped and its kind together. +struct MappedObject { + oid: ObjectID, + kind: MapType, +} + +/// Memory storage for a loose object. +struct LooseObjectMemoryMap { + to_compat: BTreeMap<ObjectID, MappedObject>, + to_storage: BTreeMap<ObjectID, MappedObject>, + compat: HashAlgorithm, + storage: HashAlgorithm, +} + +impl LooseObjectMemoryMap { + /// Create a new `LooseObjectMemoryMap`. + /// + /// The storage and compatibility `HashAlgorithm` instances are used to store the object IDs in + /// the correct map. + fn new(storage: HashAlgorithm, compat: HashAlgorithm) -> LooseObjectMemoryMap { + LooseObjectMemoryMap { + to_compat: BTreeMap::new(), + to_storage: BTreeMap::new(), + compat, + storage, + } + } + + fn len(&self) -> usize { + self.to_compat.len() + } + + /// Write this map to an interface implementing `std::io::Write`. + fn write<W: Write>(&self, wrtr: W) -> io::Result<()> { + const VERSION_NUMBER: u32 = 1; + const NUM_OBJECT_FORMATS: u32 = 2; + const PADDING: [u8; 4] = [0u8; 4]; + + let mut wrtr = wrtr; + let header_size: u32 = 4 + 4 + 4 + 4 + 4 + (4 + 4 + 8) * 2 + 8; + + wrtr.write_all(b"LMAP")?; + wrtr.write_all(&VERSION_NUMBER.to_be_bytes())?; + wrtr.write_all(&header_size.to_be_bytes())?; + wrtr.write_all(&(self.to_compat.len() as u32).to_be_bytes())?; + wrtr.write_all(&NUM_OBJECT_FORMATS.to_be_bytes())?; + + let storage_short_len = self.find_short_name_len(&self.to_compat, self.storage); + let compat_short_len = self.find_short_name_len(&self.to_storage, self.compat); + + let storage_npadding = Self::required_nul_padding(self.to_compat.len(), storage_short_len); + let compat_npadding = Self::required_nul_padding(self.to_compat.len(), compat_short_len); + + let mut offset: u64 = header_size as u64; + + for (algo, len, npadding) in &[ + (self.storage, storage_short_len, storage_npadding), + (self.compat, compat_short_len, compat_npadding), + ] { + wrtr.write_all(&algo.format_id().to_be_bytes())?; + wrtr.write_all(&(*len as u32).to_be_bytes())?; + + offset += *npadding; + wrtr.write_all(&offset.to_be_bytes())?; + + offset += self.to_compat.len() as u64 * (*len as u64 + algo.raw_len() as u64 + 4); + } + + wrtr.write_all(&offset.to_be_bytes())?; + + let order_map: BTreeMap<&ObjectID, usize> = self + .to_compat + .keys() + .enumerate() + .map(|(i, oid)| (oid, i)) + .collect(); + + wrtr.write_all(&PADDING[0..storage_npadding as usize])?; + for oid in self.to_compat.keys() { + wrtr.write_all(&oid.as_slice()[0..storage_short_len])?; + } + for oid in self.to_compat.keys() { + wrtr.write_all(oid.as_slice())?; + } + for meta in self.to_compat.values() { + wrtr.write_all(&(meta.kind as u32).to_be_bytes())?; + } + + wrtr.write_all(&PADDING[0..compat_npadding as usize])?; + for oid in self.to_storage.keys() { + wrtr.write_all(&oid.as_slice()[0..compat_short_len])?; + } + for meta in self.to_compat.values() { + wrtr.write_all(meta.oid.as_slice())?; + } + for meta in self.to_storage.values() { + wrtr.write_all(&(order_map[&meta.oid] as u32).to_be_bytes())?; + } + + Ok(()) + } + + fn required_nul_padding(nitems: usize, short_len: usize) -> u64 { + let shortened_table_len = nitems as u64 * short_len as u64; + let misalignment = shortened_table_len & 3; + // If the value is 0, return 0; otherwise, return the difference from 4. + (4 - misalignment) & 3 + } + + fn last_matching_offset(a: &ObjectID, b: &ObjectID, algop: HashAlgorithm) -> usize { + for i in 0..=algop.raw_len() { + if a.hash[i] != b.hash[i] { + return i; + } + } + algop.raw_len() + } + + fn find_short_name_len( + &self, + map: &BTreeMap<ObjectID, MappedObject>, + algop: HashAlgorithm, + ) -> usize { + if map.len() <= 1 { + return 1; + } + let mut len = 1; + let mut iter = map.keys(); + let mut cur = match iter.next() { + Some(cur) => cur, + None => return len, + }; + for item in iter { + let offset = Self::last_matching_offset(cur, item, algop); + if offset >= len { + len = offset + 1; + } + cur = item; + } + if len > algop.raw_len() { + algop.raw_len() + } else { + len + } + } +} + +struct ObjectFormatData { + data_off: usize, + shortened_len: usize, + full_off: usize, + mapping_off: Option<usize>, +} + +pub struct MmapedLooseObjectMapIter<'a> { + offset: usize, + algos: Vec<HashAlgorithm>, + source: &'a MmapedLooseObjectMap<'a>, +} + +impl<'a> Iterator for MmapedLooseObjectMapIter<'a> { + type Item = Vec<ObjectID>; + + fn next(&mut self) -> Option<Self::Item> { + if self.offset >= self.source.nitems { + return None; + } + let offset = self.offset; + self.offset += 1; + let v: Vec<ObjectID> = self + .algos + .iter() + .cloned() + .filter_map(|algo| self.source.oid_from_offset(offset, algo)) + .collect(); + if v.len() != self.algos.len() { + return None; + } + Some(v) + } +} + +#[allow(dead_code)] +pub struct MmapedLooseObjectMap<'a> { + memory: &'a [u8], + nitems: usize, + meta_off: usize, + obj_formats: BTreeMap<HashAlgorithm, ObjectFormatData>, + main_algo: HashAlgorithm, +} + +#[derive(Debug)] +#[allow(dead_code)] +enum MmapedParseError { + HeaderTooSmall, + InvalidSignature, + InvalidVersion, + UnknownAlgorithm, + OffsetTooLarge, + TooFewObjectFormats, + UnalignedData, + InvalidTrailerOffset, +} + +#[allow(dead_code)] +impl<'a> MmapedLooseObjectMap<'a> { + fn new( + slice: &'a [u8], + hash_algo: HashAlgorithm, + ) -> Result<MmapedLooseObjectMap<'a>, MmapedParseError> { + let object_format_header_size = 4 + 4 + 8; + let trailer_offset_size = 8; + let header_size: usize = + 4 + 4 + 4 + 4 + 4 + object_format_header_size * 2 + trailer_offset_size; + if slice.len() < header_size { + return Err(MmapedParseError::HeaderTooSmall); + } + if slice[0..4] != *b"LMAP" { + return Err(MmapedParseError::InvalidSignature); + } + if Self::u32_at_offset(slice, 4) != 1 { + return Err(MmapedParseError::InvalidVersion); + } + let _ = Self::u32_at_offset(slice, 8) as usize; + let nitems = Self::u32_at_offset(slice, 12) as usize; + let nobj_formats = Self::u32_at_offset(slice, 16) as usize; + if nobj_formats < 2 { + return Err(MmapedParseError::TooFewObjectFormats); + } + let mut offset = 20; + let mut meta_off = None; + let mut data = BTreeMap::new(); + for i in 0..nobj_formats { + if offset + object_format_header_size + trailer_offset_size > slice.len() { + return Err(MmapedParseError::HeaderTooSmall); + } + let format_id = Self::u32_at_offset(slice, offset); + let shortened_len = Self::u32_at_offset(slice, offset + 4) as usize; + let data_off = Self::u64_at_offset(slice, offset + 8); + + let algo = HashAlgorithm::from_format_id(format_id) + .ok_or(MmapedParseError::UnknownAlgorithm)?; + let data_off: usize = data_off + .try_into() + .map_err(|_| MmapedParseError::OffsetTooLarge)?; + + // Every object format must have these entries. + let shortened_table_len = shortened_len + .checked_mul(nitems) + .ok_or(MmapedParseError::OffsetTooLarge)?; + let full_off = data_off + .checked_add(shortened_table_len) + .ok_or(MmapedParseError::OffsetTooLarge)?; + Self::verify_aligned(full_off)?; + Self::verify_valid(slice, full_off as u64)?; + + let full_length = algo + .raw_len() + .checked_mul(nitems) + .ok_or(MmapedParseError::OffsetTooLarge)?; + let off = full_length + .checked_add(full_off) + .ok_or(MmapedParseError::OffsetTooLarge)?; + Self::verify_aligned(off)?; + Self::verify_valid(slice, off as u64)?; + + // This is for the metadata for the first object format and for the order mapping for + // other object formats. + let meta_size = nitems + .checked_mul(4) + .ok_or(MmapedParseError::OffsetTooLarge)?; + let meta_end = off + .checked_add(meta_size) + .ok_or(MmapedParseError::OffsetTooLarge)?; + Self::verify_valid(slice, meta_end as u64)?; + + let mut mapping_off = None; + if i == 0 { + meta_off = Some(off); + } else { + mapping_off = Some(off); + } + + data.insert( + algo, + ObjectFormatData { + data_off, + shortened_len, + full_off, + mapping_off, + }, + ); + offset += object_format_header_size; + } + let trailer = Self::u64_at_offset(slice, offset); + Self::verify_aligned(trailer as usize)?; + Self::verify_valid(slice, trailer)?; + let end = trailer + .checked_add(hash_algo.raw_len() as u64) + .ok_or(MmapedParseError::OffsetTooLarge)?; + if end != slice.len() as u64 { + return Err(MmapedParseError::InvalidTrailerOffset); + } + match meta_off { + Some(meta_off) => Ok(MmapedLooseObjectMap { + memory: slice, + nitems, + meta_off, + obj_formats: data, + main_algo: hash_algo, + }), + None => Err(MmapedParseError::TooFewObjectFormats), + } + } + + fn iter(&self) -> MmapedLooseObjectMapIter<'_> { + let mut algos = Vec::with_capacity(self.obj_formats.len()); + algos.push(self.main_algo); + for algo in self.obj_formats.keys().cloned() { + if algo != self.main_algo { + algos.push(algo); + } + } + MmapedLooseObjectMapIter { + offset: 0, + algos, + source: self, + } + } + + /// Treats `sl` as if it were a set of slices of `wanted.len()` bytes, and searches for + /// `wanted` within it. + /// + /// If found, returns the offset of the subslice in `sl`. + /// + /// ``` + /// let sl = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]; + /// + /// assert_eq!(MmapedLooseObjectMap::binary_search_slice(sl, &[2, 3]), Some(1)); + /// assert_eq!(MmapedLooseObjectMap::binary_search_slice(sl, &[6, 7]), Some(4)); + /// assert_eq!(MmapedLooseObjectMap::binary_search_slice(sl, &[1, 2]), None); + /// assert_eq!(MmapedLooseObjectMap::binary_search_slice(sl, &[10, 20]), None); + /// ``` + fn binary_search_slice(sl: &[u8], wanted: &[u8]) -> Option<usize> { + let len = wanted.len(); + let res = sl.binary_search_by(|item| { + // We would like element_offset, but that is currently nightly only. Instead, do a + // pointer subtraction to find the index. + let index = unsafe { (item as *const u8).offset_from(sl.as_ptr()) } as usize; + // Now we have the index of this object. Round it down to the nearest full-sized + // chunk to find the actual offset where this starts. + let index = index - (index % len); + // Compute the comparison of that value instead, which will provide the expected + // result. + sl[index..index + wanted.len()].cmp(wanted) + }); + res.ok().map(|offset| offset / len) + } + + /// Look up `oid` in the map in order to convert it to `algo`. + /// + /// If this object is in the map, return the offset in the table for the main algorithm. + fn look_up_object(&self, oid: &ObjectID) -> Option<usize> { + let oid_algo = HashAlgorithm::from_u32(oid.algo)?; + let params = self.obj_formats.get(&oid_algo)?; + let short_table = + &self.memory[params.data_off..params.data_off + (params.shortened_len * self.nitems)]; + let index = + Self::binary_search_slice(short_table, &oid.as_slice()[0..params.shortened_len])?; + match params.mapping_off { + Some(from_off) => { + // oid is in a compatibility algorithm. Find the mapping index. + let mapped = Self::u32_at_offset(self.memory, from_off + index * 4) as usize; + if mapped >= self.nitems { + return None; + } + let oid_offset = params.full_off + mapped * oid_algo.raw_len(); + if self.memory[oid_offset..oid_offset + oid_algo.raw_len()] != *oid.as_slice() { + return None; + } + Some(mapped) + } + None => { + // oid is in the main algorithm. Find the object ID in the main map to confirm + // it's correct. + let oid_offset = params.full_off + index * oid_algo.raw_len(); + if self.memory[oid_offset..oid_offset + oid_algo.raw_len()] != *oid.as_slice() { + return None; + } + Some(index) + } + } + } + + #[allow(dead_code)] + fn map_object(&self, oid: &ObjectID, algo: HashAlgorithm) -> Option<MappedObject> { + let main = self.look_up_object(oid)?; + let meta = MapType::from_u32(Self::u32_at_offset(self.memory, self.meta_off + (main * 4)))?; + Some(MappedObject { + oid: self.oid_from_offset(main, algo)?, + kind: meta, + }) + } + + fn map_oid(&self, oid: &ObjectID, algo: HashAlgorithm) -> Option<ObjectID> { + if algo as u32 == oid.algo { + return Some(oid.clone()); + } + + let main = self.look_up_object(oid)?; + self.oid_from_offset(main, algo) + } + + fn oid_from_offset(&self, offset: usize, algo: HashAlgorithm) -> Option<ObjectID> { + let aparams = self.obj_formats.get(&algo)?; + + let mut hash = [0u8; GIT_MAX_RAWSZ]; + let len = algo.raw_len(); + let oid_off = aparams.full_off + (offset * len); + hash[0..len].copy_from_slice(&self.memory[oid_off..oid_off + len]); + Some(ObjectID { + hash, + algo: algo as u32, + }) + } + + fn u32_at_offset(slice: &[u8], offset: usize) -> u32 { + u32::from_be_bytes(slice[offset..offset + 4].try_into().unwrap()) + } + + fn u64_at_offset(slice: &[u8], offset: usize) -> u64 { + u64::from_be_bytes(slice[offset..offset + 8].try_into().unwrap()) + } + + fn verify_aligned(offset: usize) -> Result<(), MmapedParseError> { + if (offset & 3) != 0 { + return Err(MmapedParseError::UnalignedData); + } + Ok(()) + } + + fn verify_valid(slice: &[u8], offset: u64) -> Result<(), MmapedParseError> { + if offset >= slice.len() as u64 { + return Err(MmapedParseError::OffsetTooLarge); + } + Ok(()) + } +} + +/// A map for loose and other non-packed object IDs that maps between a storage and compatibility +/// mapping. +/// +/// In addition to the in-memory option, there is an optional batched storage, which can be used to +/// write objects to disk in an efficient way. +pub struct LooseObjectMap { + mem: LooseObjectMemoryMap, + batch: Option<LooseObjectMemoryMap>, +} + +impl LooseObjectMap { + /// Create a new `LooseObjectMap` with the given hash algorithms. + /// + /// This initializes the memory map to automatically map the empty tree, empty blob, and null + /// object ID. + pub fn new(storage: HashAlgorithm, compat: HashAlgorithm) -> LooseObjectMap { + let mut map = LooseObjectMemoryMap::new(storage, compat); + for (main, compat) in &[ + (storage.empty_tree(), compat.empty_tree()), + (storage.empty_blob(), compat.empty_blob()), + (storage.null_oid(), compat.null_oid()), + ] { + map.to_storage.insert( + (*compat).clone(), + MappedObject { + oid: (*main).clone(), + kind: MapType::Reserved, + }, + ); + map.to_compat.insert( + (*main).clone(), + MappedObject { + oid: (*compat).clone(), + kind: MapType::Reserved, + }, + ); + } + LooseObjectMap { + mem: map, + batch: None, + } + } + + pub fn hash_algo(&self) -> HashAlgorithm { + self.mem.storage + } + + /// Start a batch for efficient writing. + /// + /// If there is already a batch started, this does nothing and the existing batch is retained. + pub fn start_batch(&mut self) { + if self.batch.is_none() { + self.batch = Some(LooseObjectMemoryMap::new(self.mem.storage, self.mem.compat)); + } + } + + pub fn batch_len(&self) -> Option<usize> { + self.batch.as_ref().map(|b| b.len()) + } + + /// If a batch exists, write it to the writer. + pub fn finish_batch<W: Write>(&mut self, w: W) -> io::Result<()> { + if let Some(txn) = self.batch.take() { + txn.write(w)?; + } + Ok(()) + } + + /// If a batch exists, write it to the writer. + pub fn abort_batch(&mut self) { + self.batch = None; + } + + /// Return whether there is a batch already started. + /// + /// If you just want a batch to exist and don't care whether one has already been started, you + /// may simply call `start_batch` unconditionally. + pub fn has_batch(&self) -> bool { + self.batch.is_some() + } + + /// Insert an object into the map. + /// + /// If `write` is true and there is a batch started, write the object into the batch as well as + /// into the memory map. + pub fn insert(&mut self, oid1: &ObjectID, oid2: &ObjectID, kind: MapType, write: bool) { + let (compat_oid, storage_oid) = + if HashAlgorithm::from_u32(oid1.algo) == Some(self.mem.compat) { + (oid1, oid2) + } else { + (oid2, oid1) + }; + Self::insert_into(&mut self.mem, storage_oid, compat_oid, kind); + if write { + if let Some(ref mut batch) = self.batch { + Self::insert_into(batch, storage_oid, compat_oid, kind); + } + } + } + + fn insert_into( + map: &mut LooseObjectMemoryMap, + storage: &ObjectID, + compat: &ObjectID, + kind: MapType, + ) { + map.to_compat.insert( + storage.clone(), + MappedObject { + oid: compat.clone(), + kind, + }, + ); + map.to_storage.insert( + compat.clone(), + MappedObject { + oid: storage.clone(), + kind, + }, + ); + } + + #[allow(dead_code)] + fn map_object(&self, oid: &ObjectID, algo: HashAlgorithm) -> Option<&MappedObject> { + let map = if algo == self.mem.storage { + &self.mem.to_storage + } else { + &self.mem.to_compat + }; + map.get(oid) + } + + #[allow(dead_code)] + fn map_oid<'a, 'b: 'a>( + &'b self, + oid: &'a ObjectID, + algo: HashAlgorithm, + ) -> Option<&'a ObjectID> { + if algo as u32 == oid.algo { + return Some(oid); + } + let entry = self.map_object(oid, algo); + entry.map(|obj| &obj.oid) + } +} + +#[cfg(test)] +mod tests { + use super::{LooseObjectMap, LooseObjectMemoryMap, MapType, MmapedLooseObjectMap}; + use crate::hash::{HashAlgorithm, Hasher, ObjectID}; + use std::convert::TryInto; + use std::io::{self, Cursor, Write}; + + struct TrailingWriter { + curs: Cursor<Vec<u8>>, + hasher: Hasher, + } + + impl TrailingWriter { + fn new() -> TrailingWriter { + TrailingWriter { + curs: Cursor::new(Vec::new()), + hasher: Hasher::new(HashAlgorithm::SHA256), + } + } + + fn finalize(mut self) -> Vec<u8> { + let _ = self.hasher.flush(); + let mut v = self.curs.into_inner(); + v.extend(self.hasher.into_vec()); + v + } + } + + impl Write for TrailingWriter { + fn write(&mut self, data: &[u8]) -> io::Result<usize> { + self.hasher.write_all(data)?; + self.curs.write_all(data)?; + Ok(data.len()) + } + + fn flush(&mut self) -> io::Result<()> { + self.hasher.flush()?; + self.curs.flush()?; + Ok(()) + } + } + + fn sha1_oid(b: &[u8]) -> ObjectID { + assert_eq!(b.len(), 20); + let mut data = [0u8; 32]; + data[0..20].copy_from_slice(b); + ObjectID { + hash: data, + algo: HashAlgorithm::SHA1 as u32, + } + } + + fn sha256_oid(b: &[u8]) -> ObjectID { + assert_eq!(b.len(), 32); + ObjectID { + hash: b.try_into().unwrap(), + algo: HashAlgorithm::SHA256 as u32, + } + } + + fn test_entries() -> &'static [(&'static str, &'static [u8], &'static [u8], MapType, bool)] { + // These are all example blobs containing the content in the first argument. + &[ + ("abc", b"\xf2\xba\x8f\x84\xab\x5c\x1b\xce\x84\xa7\xb4\x41\xcb\x19\x59\xcf\xc7\x09\x3b\x7f", b"\xc1\xcf\x6e\x46\x50\x77\x93\x0e\x88\xdc\x51\x36\x64\x1d\x40\x2f\x72\xa2\x29\xdd\xd9\x96\xf6\x27\xd6\x0e\x96\x39\xea\xba\x35\xa6", MapType::LooseObject, false), + ("def", b"\x0c\x00\x38\x32\xe7\xbf\xa9\xca\x8b\x5c\x20\x35\xc9\xbd\x68\x4a\x5f\x26\x23\xbc", b"\x8a\x90\x17\x26\x48\x4d\xb0\xf2\x27\x9f\x30\x8d\x58\x96\xd9\x6b\xf6\x3a\xd6\xde\x95\x7c\xa3\x8a\xdc\x33\x61\x68\x03\x6e\xf6\x63", MapType::Shallow, true), + ("ghi", b"\x45\xa8\x2e\x29\x5c\x52\x47\x31\x14\xc5\x7c\x18\xf4\xf5\x23\x68\xdf\x2a\x3c\xfd", b"\x6e\x47\x4c\x74\xf5\xd7\x78\x14\xc7\xf7\xf0\x7c\x37\x80\x07\x90\x53\x42\xaf\x42\x81\xe6\x86\x8d\x33\x46\x45\x4b\xb8\x63\xab\xc3", MapType::Submodule, false), + ("jkl", b"\x45\x32\x8c\x36\xff\x2e\x9b\x9b\x4e\x59\x2c\x84\x7d\x3f\x9a\x7f\xd9\xb3\xe7\x16", b"\xc3\xee\xf7\x54\xa2\x1e\xc6\x9d\x43\x75\xbe\x6f\x18\x47\x89\xa8\x11\x6f\xd9\x66\xfc\x67\xdc\x31\xd2\x11\x15\x42\xc8\xd5\xa0\xaf", MapType::LooseObject, true), + ] + } + + fn test_map(write_all: bool) -> Box<LooseObjectMap> { + let mut map = Box::new(LooseObjectMap::new( + HashAlgorithm::SHA256, + HashAlgorithm::SHA1, + )); + + map.start_batch(); + + for (_blob_content, sha1, sha256, kind, swap) in test_entries() { + let s256 = sha256_oid(sha256); + let s1 = sha1_oid(sha1); + let write = write_all || (*kind as u32 & 2) == 0; + if *swap { + // Insert the item into the batch arbitrarily based on the type. This tests that + // we can specify either order and we'll do the right thing. + map.insert(&s256, &s1, *kind, write); + } else { + map.insert(&s1, &s256, *kind, write); + } + } + + map + } + + #[test] + fn can_read_and_write_format() { + for full in &[true, false] { + let mut map = test_map(*full); + let mut wrtr = TrailingWriter::new(); + map.finish_batch(&mut wrtr).unwrap(); + + assert_eq!(map.has_batch(), false); + + let data = wrtr.finalize(); + MmapedLooseObjectMap::new(&data, HashAlgorithm::SHA256).unwrap(); + } + } + + #[test] + fn looks_up_from_mmaped() { + let mut map = test_map(true); + let mut wrtr = TrailingWriter::new(); + map.finish_batch(&mut wrtr).unwrap(); + + assert_eq!(map.has_batch(), false); + + let data = wrtr.finalize(); + let entries = test_entries(); + let map = MmapedLooseObjectMap::new(&data, HashAlgorithm::SHA256).unwrap(); + + for (_, sha1, sha256, kind, _) in entries { + let s256 = sha256_oid(sha256); + let s1 = sha1_oid(sha1); + + let res = map.map_object(&s256, HashAlgorithm::SHA1).unwrap(); + assert_eq!(res.oid, s1); + assert_eq!(res.kind, *kind); + let res = map.map_oid(&s256, HashAlgorithm::SHA1).unwrap(); + assert_eq!(res, s1); + + let res = map.map_object(&s256, HashAlgorithm::SHA256).unwrap(); + assert_eq!(res.oid, s256); + assert_eq!(res.kind, *kind); + let res = map.map_oid(&s256, HashAlgorithm::SHA256).unwrap(); + assert_eq!(res, s256); + + let res = map.map_object(&s1, HashAlgorithm::SHA256).unwrap(); + assert_eq!(res.oid, s256); + assert_eq!(res.kind, *kind); + let res = map.map_oid(&s1, HashAlgorithm::SHA256).unwrap(); + assert_eq!(res, s256); + + let res = map.map_object(&s1, HashAlgorithm::SHA1).unwrap(); + assert_eq!(res.oid, s1); + assert_eq!(res.kind, *kind); + let res = map.map_oid(&s1, HashAlgorithm::SHA1).unwrap(); + assert_eq!(res, s1); + } + + for octet in &[0x00u8, 0x6d, 0x6e, 0x8a, 0xff] { + let missing_oid = ObjectID { + hash: [*octet; 32], + algo: HashAlgorithm::SHA256 as u32, + }; + + assert!(map.map_object(&missing_oid, HashAlgorithm::SHA1).is_none()); + assert!(map.map_oid(&missing_oid, HashAlgorithm::SHA1).is_none()); + + assert_eq!( + map.map_oid(&missing_oid, HashAlgorithm::SHA256).unwrap(), + missing_oid + ); + } + } + + #[test] + fn binary_searches_slices_correctly() { + let sl = &[ + 0, 1, 2, 15, 14, 13, 18, 10, 2, 20, 20, 20, 21, 21, 0, 21, 21, 1, 21, 21, 21, 21, 21, + 22, 22, 23, 24, + ]; + + let expected: &[(&[u8], Option<usize>)] = &[ + (&[0, 1, 2], Some(0)), + (&[15, 14, 13], Some(1)), + (&[18, 10, 2], Some(2)), + (&[20, 20, 20], Some(3)), + (&[21, 21, 0], Some(4)), + (&[21, 21, 1], Some(5)), + (&[21, 21, 21], Some(6)), + (&[21, 21, 22], Some(7)), + (&[22, 23, 24], Some(8)), + (&[2, 15, 14], None), + (&[0, 21, 21], None), + (&[21, 21, 23], None), + (&[22, 22, 23], None), + (&[0xff, 0xff, 0xff], None), + (&[0, 0, 0], None), + ]; + + for (wanted, value) in expected { + assert_eq!( + MmapedLooseObjectMap::binary_search_slice(sl, wanted), + *value + ); + } + } + + #[test] + fn looks_up_oid_correctly() { + let map = test_map(false); + let entries = test_entries(); + + let s256 = sha256_oid(entries[0].2); + let s1 = sha1_oid(entries[0].1); + + let missing_oid = ObjectID { + hash: [0xffu8; 32], + algo: HashAlgorithm::SHA256 as u32, + }; + + let res = map.map_object(&s256, HashAlgorithm::SHA1).unwrap(); + assert_eq!(res.oid, s1); + assert_eq!(res.kind, MapType::LooseObject); + let res = map.map_oid(&s256, HashAlgorithm::SHA1).unwrap(); + assert_eq!(*res, s1); + + let res = map.map_object(&s1, HashAlgorithm::SHA256).unwrap(); + assert_eq!(res.oid, s256); + assert_eq!(res.kind, MapType::LooseObject); + let res = map.map_oid(&s1, HashAlgorithm::SHA256).unwrap(); + assert_eq!(*res, s256); + + assert!(map.map_object(&missing_oid, HashAlgorithm::SHA1).is_none()); + assert!(map.map_oid(&missing_oid, HashAlgorithm::SHA1).is_none()); + + assert_eq!( + *map.map_oid(&missing_oid, HashAlgorithm::SHA256).unwrap(), + missing_oid + ); + } + + #[test] + fn looks_up_known_oids_correctly() { + let map = test_map(false); + + let funcs: &[&dyn Fn(HashAlgorithm) -> &'static ObjectID] = &[ + &|h: HashAlgorithm| h.empty_tree(), + &|h: HashAlgorithm| h.empty_blob(), + &|h: HashAlgorithm| h.null_oid(), + ]; + + for f in funcs { + let s256 = f(HashAlgorithm::SHA256); + let s1 = f(HashAlgorithm::SHA1); + + let res = map.map_object(&s256, HashAlgorithm::SHA1).unwrap(); + assert_eq!(res.oid, *s1); + assert_eq!(res.kind, MapType::Reserved); + let res = map.map_oid(&s256, HashAlgorithm::SHA1).unwrap(); + assert_eq!(*res, *s1); + + let res = map.map_object(&s1, HashAlgorithm::SHA256).unwrap(); + assert_eq!(res.oid, *s256); + assert_eq!(res.kind, MapType::Reserved); + let res = map.map_oid(&s1, HashAlgorithm::SHA256).unwrap(); + assert_eq!(*res, *s256); + } + } + + #[test] + fn nul_padding() { + assert_eq!(LooseObjectMemoryMap::required_nul_padding(1, 1), 3); + assert_eq!(LooseObjectMemoryMap::required_nul_padding(2, 1), 2); + assert_eq!(LooseObjectMemoryMap::required_nul_padding(3, 1), 1); + assert_eq!(LooseObjectMemoryMap::required_nul_padding(2, 2), 0); + + assert_eq!(LooseObjectMemoryMap::required_nul_padding(39, 3), 3); + } +} diff --git a/src/meson.build b/src/meson.build index c77041a3fa..1eea068519 100644 --- a/src/meson.build +++ b/src/meson.build @@ -1,6 +1,7 @@ libgit_rs_sources = [ 'hash.rs', 'lib.rs', + 'loose.rs', 'varint.rs', ] ^ permalink raw reply related [flat|nested] 101+ messages in thread
* Re: [PATCH 12/14] rust: add a new binary loose object map format 2025-10-27 0:44 ` [PATCH 12/14] rust: add a new binary loose object map format brian m. carlson @ 2025-10-28 9:18 ` Patrick Steinhardt 2025-10-29 1:37 ` brian m. carlson 2025-10-29 17:03 ` Junio C Hamano 2025-10-29 18:21 ` Junio C Hamano 2 siblings, 1 reply; 101+ messages in thread From: Patrick Steinhardt @ 2025-10-28 9:18 UTC (permalink / raw) To: brian m. carlson; +Cc: git, Junio C Hamano, Ezekiel Newren On Mon, Oct 27, 2025 at 12:44:02AM +0000, brian m. carlson wrote: > Our current loose object format has a few problems. First, it is not > efficient: the list of object IDs is not sorted and even if it were, > there would not be an efficient way to look up objects in both > algorithms. > > Second, we need to store mappings for things which are not technically > loose objects but are not packed objects, either, and so cannot be > stored in a pack index. These kinds of things include shallows, their > parents, and their trees, as well as submodules. Yet we also need to > implement a sensible way to store the kind of object so that we can > prune unneeded entries. For instance, if the user has updated the > shallows, we can remove the old values. Doesn't this indicate that calling this "loose object map" is kind of a misnomer? If we want to be able to store arbitrary objects regardless of the way those are stored (or not stored) in the ODB then I think it's overall quite confusing to have "loose" in the name. This isn't something we can fix for the old loose object map. But shouldn't we fix this now for the new format you're about to introduce? > For these reasons, introduce a new binary loose object map format. The > careful reader will notice that it resembles very closely the pack index > v3 format. Add an in-memory loose object map as well, and allow > enabling writing to a batched map, which can then be written later as > one of the binary loose object maps. Include several tests for round > tripping and data lookup across algorithms. s/enabling// > diff --git a/Documentation/gitformat-loose.adoc b/Documentation/gitformat-loose.adoc > index 947993663e..4850c91669 100644 > --- a/Documentation/gitformat-loose.adoc > +++ b/Documentation/gitformat-loose.adoc > @@ -48,6 +50,108 @@ stored under > Similarly, a blob containing the contents `abc` would have the uncompressed > data of `blob 3\0abc`. > > +== Loose object mapping > + > +When the `compatObjectFormat` option is used, Git needs to store a mapping > +between the repository's main algorithm and the compatibility algorithm. There > +are two formats for this: the legacy mapping and the modern mapping. > + > +=== Legacy mapping > + > +The compatibility mapping is stored in a file called > +`$GIT_DIR/objects/loose-object-idx`. The format of this file looks like this: > + > + # loose-object-idx > + (main-name SP compat-name LF)* > + > +`main-name` refers to hexadecimal object ID of the object in the main > +repository format and `compat-name` refers to the same thing, but for the > +compatibility format. > + > +This format is read if it exists but is not written. > + > +Note that carriage returns are not permitted in this file, regardless of the > +host system or configuration. As far as I understood, this legacy mapping wasn't really used anywhere as it is basically nonfunctional in the first place. Can we get away with dropping it altogether? > +=== Modern mapping > + > +The modern mapping consists of a set of files under `$GIT_DIR/objects/loose` > +ending in `.map`. The portion of the filename before the extension is that of > +the hash checksum in hex format. Given that we're talking about multiple different hashes: which hash function is used for this checksum? I assume it's the main hash, but it might be sensible to document this. > +`git pack-objects` will repack existing entries into one file, removing any > +unnecessary objects, such as obsolete shallow entries or loose objects that > +have been packed. Curious that this is put into git-pack-objects(1), as it doesn't quite feel related to the task. Sure, it generates packfiles, but it doesn't really handle the logic to manage loose objects/packfiles in the repo. This feels closer to what git-repack(1) is doing, so would that be a better place to put it? > +==== Mapping file format > + > +- A header appears at the beginning and consists of the following: > + * A 4-byte mapping signature: `LMAP` > + * 4-byte version number: 1 > + * 4-byte length of the header section. > + * 4-byte number of objects declared in this map file. > + * 4-byte number of object formats declared in this map file. > + * For each object format: > + ** 4-byte format identifier (e.g., `sha1` for SHA-1) > + ** 4-byte length in bytes of shortened object names. This is the > + shortest possible length needed to make names in the shortened > + object name table unambiguous. > + ** 8-byte integer, recording where tables relating to this format > + are stored in this index file, as an offset from the beginning. As far as I understand this allows us to even store multiple compatibility hashes if we were ever to grow a third hash. We would still be able to binary-search through the file as we can compute the size of every record with this header. > + * 8-byte offset to the trailer from the beginning of this file. > + * Zero or more additional key/value pairs (4-byte key, 4-byte value), which > + may optionally declare one or more chunks. No chunks are currently > + defined. Readers must ignore unrecognized keys. How does the reader identify these key/value pairs and know how many of those there are? Also, do you already have an idea what those should be used for? > +- Zero or more NUL bytes. These are used to improve the alignment of the > + 4-byte quantities below. How does one figure out how many NUL bytes there's going to be? I guess the reader doesn't need to know as it simply uses the length of the header section to seek to the tables? > +- Tables for the first object format: > + * A sorted table of shortened object names. These are prefixes of the names > + of all objects in this file, packed together without offset values to > + reduce the cache footprint of the binary search for a specific object name. Okay. The length of the shortened object names is encoded in the header, so all of the objects have the same length. Does the reader have a way to disambiguate the shortened object names? They may be unambiguous at the point in time where the mapping is written, but when they are being shortened it becomes plausible that the object names becomes ambiguous at a later point in time. > + * A sorted table of full object names. Ah, I see! We have a second table further down that encodes full object names, so yes, we can fully disambiguate. > + * A table of 4-byte metadata values. > + * Zero or more chunks. A chunk starts with a four-byte chunk identifier and > + a four-byte parameter (which, if unneeded, is all zeros) and an eight-byte > + size (not including the identifier, parameter, or size), plus the chunk > + data. > +- Zero or more NUL bytes. > +- Tables for subsequent object formats: > + * A sorted table of shortened object names. These are prefixes of the names > + of all objects in this file, packed together without offset values to > + reduce the cache footprint of the binary search for a specific object name. > + * A table of full object names in the order specified by the first object format. Interesting, why are these sorted by the first object format again? Doesn't that mean that I have to do a linear search now to locate the entry for the second object format? Disclaimer: the following paragraphs go into how I would have designed this. This is _not_ meant as a "you have to do it this way", but as a discussion starter to figure out why you have picked the proposed format and for me to get a better understanding of it. Stepping back a bit, my expectation is that we'd have one lookup table per object format so that we can map into all directions: SHA1 -> SHA256 and in reverse. If we had more than two hash functions we'd also need to have a table for e.g. Blake3 -> SHA1 and Blake3 -> SHA256 and reverse. One way to do this is to have three tables, one for each object format. The object formats would be ordered lexicographically by their own object ID, so that one can perform a binary search for an object ID in every format. Each row could then either contain all compatibility hashes directly, but this would explode quite fast in storage space. An optimization would thus be to have one table per object format that contains the shortened object ID plus an offset where the actual record can be found. You know where to find the tables from the header, and you know the exact size of each entry, so you can trivially perform a binary search for the abbreviated object ID in that index. Once you've found that index you take the stored offset to look up the record in the "main" table. This main table contains the full object IDs for all object hashes. So something like the following simplified format: +---------------------------------+ | header | | Format version | | Number of object IDs | | SHA1: abbrev, offset | | SHA256: abbrev, offset | | Blake3: abbrev, offset | | Main: offset | +---------------------------------+ | table for SHA1 | | 11111 -> 1 | | 22222 -> 2 | +---------------------------------+ | table for SHA256 | | aaaaa -> 2 | | bbbbb -> 1 | +---------------------------------+ | table for Blake3 | | 88888 -> 2 | | 99999 -> 1 | +---------------------------------+ | main table | | 11111111 -> bbbbbbbb -> 9999999 | | 22222222 -> aaaaaaaa -> 8888888 | +---------------------------------+ | trailer | | trailer hash | +---------------------------------+ Overall you only have to store the full object ID for each hash exactly once, and the mappings also only have to be stored once. But you can look up an ID by each of its formats via its indices. With some slight adjustments one could also adapt this format to become streamable: - The header only contains the format information as well as which hash functions are contained. - The header is followed by the main table. The order of these objects is basically the streaming order, we don't care about it. We also don't have to abbreviate any hashes here. Like this we can stream the mappings to disk one by one, and we only need to remember the specific offsets where each mapping was stored. - Once all mappings have been streamed we can then write the lookup tables. We remember the starting index for each lookup table. - The footer contains the number of records stored in the table as well as the individual abbreviated object ID lengths per hash. From that number it becomes trivial to compute the offsets of every single lookup table. The offset of the main table is static. +---------------------------------+ | header | | Format version | | SHA1 | | SHA256 | | Blake3 | +---------------------------------+ | main table | | 11111111 -> bbbbbbbb -> 9999999 | | 22222222 -> aaaaaaaa -> 8888888 | +---------------------------------+ | table for SHA1 | | 11111 -> 1 | | 22222 -> 2 | +---------------------------------+ | table for SHA256 | | aaaaa -> 2 | | bbbbb -> 1 | +---------------------------------+ | table for Blake3 | | 88888 -> 2 | | 99999 -> 1 | +---------------------------------+ | trailer | | number of objects | | SHA1 abbrev | | SHA256 abbrev | | Blake3 abbrev | | hash | +---------------------------------+ Anyway, this is how I would have designed this format, and I think your format works differently. As I said, my intent here is not to say that you should take my format, but I mostly intend it as a discussion starter to figure out why you have chosen the proposed design so that I can get a better understanding for it. Thanks! Patrick ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH 12/14] rust: add a new binary loose object map format 2025-10-28 9:18 ` Patrick Steinhardt @ 2025-10-29 1:37 ` brian m. carlson 2025-10-29 9:07 ` Patrick Steinhardt 0 siblings, 1 reply; 101+ messages in thread From: brian m. carlson @ 2025-10-29 1:37 UTC (permalink / raw) To: Patrick Steinhardt; +Cc: git, Junio C Hamano, Ezekiel Newren [-- Attachment #1: Type: text/plain, Size: 11147 bytes --] On 2025-10-28 at 09:18:32, Patrick Steinhardt wrote: > Doesn't this indicate that calling this "loose object map" is kind of a > misnomer? If we want to be able to store arbitrary objects regardless of > the way those are stored (or not stored) in the ODB then I think it's > overall quite confusing to have "loose" in the name. > > This isn't something we can fix for the old loose object map. But > shouldn't we fix this now for the new format you're about to introduce? Sure. I will admit I'm terrible at naming things. What do you think it should be called. > s/enabling// Will fix in v2. > As far as I understood, this legacy mapping wasn't really used anywhere > as it is basically nonfunctional in the first place. Can we get away > with dropping it altogether? Sure, I can do that. > Given that we're talking about multiple different hashes: which hash > function is used for this checksum? I assume it's the main hash, but it > might be sensible to document this. It is the main hash. I'll update that for v2. > > +`git pack-objects` will repack existing entries into one file, removing any > > +unnecessary objects, such as obsolete shallow entries or loose objects that > > +have been packed. > > Curious that this is put into git-pack-objects(1), as it doesn't quite > feel related to the task. Sure, it generates packfiles, but it doesn't > really handle the logic to manage loose objects/packfiles in the repo. > This feels closer to what git-repack(1) is doing, so would that be a > better place to put it? I've actually put this into `git gc`, which will come in in a future series, so I'll update this for v2. > As far as I understand this allows us to even store multiple > compatibility hashes if we were ever to grow a third hash. We would > still be able to binary-search through the file as we can compute the > size of every record with this header. Exactly. We were discussing BLAKE3 at the contributor summit as a potential option. The careful reader will note that this format looks suspiciously like pack index v3, which is intentional. > > + * 8-byte offset to the trailer from the beginning of this file. > > + * Zero or more additional key/value pairs (4-byte key, 4-byte value), which > > + may optionally declare one or more chunks. No chunks are currently > > + defined. Readers must ignore unrecognized keys. > > How does the reader identify these key/value pairs and know how many of > those there are? Also, do you already have an idea what those should be > used for? I'd imagined we could do something like fanout entries for tree structures to help parse large trees better (since trees cannot be binary searched). That's something I wanted to add to multi-pack index as a set of chunks. They are read until the end of the header section. > How does one figure out how many NUL bytes there's going to be? I guess > the reader doesn't need to know as it simply uses the length of the > header section to seek to the tables? Exactly. This is what we do with pack index v3 as well. As a practical matter, every chunk of NUL padding contains 0 to 3 bytes: just enough to align the data for 4-byte access. > > +- Tables for the first object format: > > + * A sorted table of shortened object names. These are prefixes of the names > > + of all objects in this file, packed together without offset values to > > + reduce the cache footprint of the binary search for a specific object name. > > Okay. The length of the shortened object names is encoded in the header, > so all of the objects have the same length. > > Does the reader have a way to disambiguate the shortened object names? > They may be unambiguous at the point in time where the mapping is > written, but when they are being shortened it becomes plausible that the > object names becomes ambiguous at a later point in time. > > > + * A sorted table of full object names. > > Ah, I see! We have a second table further down that encodes full object > names, so yes, we can fully disambiguate. > > > + * A table of 4-byte metadata values. > > + * Zero or more chunks. A chunk starts with a four-byte chunk identifier and > > + a four-byte parameter (which, if unneeded, is all zeros) and an eight-byte > > + size (not including the identifier, parameter, or size), plus the chunk > > + data. > > +- Zero or more NUL bytes. > > +- Tables for subsequent object formats: > > + * A sorted table of shortened object names. These are prefixes of the names > > + of all objects in this file, packed together without offset values to > > + reduce the cache footprint of the binary search for a specific object name. > > + * A table of full object names in the order specified by the first object format. > > Interesting, why are these sorted by the first object format again? > Doesn't that mean that I have to do a linear search now to locate the > entry for the second object format? No, it doesn't. The full object names are always in the order of the first format. The shortened names for second and subsequent formats point into an offset table that finds the offset in the first format. Therefore, to look up an OID in the second format knowing its OID in the first format, you use the first format's prefixes to find its offset, verify its OID in the full object names, and then look up that offset in the list of full object names in the second format. To go the other way, you find the prefix in the second format, find its corresponding offset in the mapping table, verify the full object ID in the second format, and then look up that offset in the full object names in the first format. > Disclaimer: the following paragraphs go into how I would have > designed this. This is _not_ meant as a "you have to do it this > way", but as a discussion starter to figure out why you have picked > the proposed format and for me to get a better understanding of it. The answer is that it very much resembles pack index v3, except that instead of having pack order, we just always use the sorted order of the first object format (since we don't have a pack). That also makes the data deterministic so that we always write identical files for identical objects. > Stepping back a bit, my expectation is that we'd have one lookup table > per object format so that we can map into all directions: SHA1 -> SHA256 > and in reverse. If we had more than two hash functions we'd also need to > have a table for e.g. Blake3 -> SHA1 and Blake3 -> SHA256 and reverse. Yeah, and then the file gets very large. We mmap these into memory and never free them during the life of the program (except when compacting them and deleting the unused ones), so we want to be quite conservative with our memory. > One way to do this is to have three tables, one for each object format. > The object formats would be ordered lexicographically by their own > object ID, so that one can perform a binary search for an object ID in > every format. We have that with the shortened object IDs and we do a binary search over those. This is more cache-friendly and all we need to do is verify that the full object ID matches our value (as opposed to a different object stored elsewhere with an identical shortened prefix). > Each row could then either contain all compatibility hashes directly, > but this would explode quite fast in storage space. An optimization > would thus be to have one table per object format that contains the > shortened object ID plus an offset where the actual record can be found. > You know where to find the tables from the header, and you know the > exact size of each entry, so you can trivially perform a binary search > for the abbreviated object ID in that index. > > Once you've found that index you take the stored offset to look up the > record in the "main" table. This main table contains the full object IDs > for all object hashes. So something like the following simplified > format: > > +---------------------------------+ > | header | > | Format version | > | Number of object IDs | > | SHA1: abbrev, offset | > | SHA256: abbrev, offset | > | Blake3: abbrev, offset | > | Main: offset | > +---------------------------------+ > | table for SHA1 | > | 11111 -> 1 | > | 22222 -> 2 | > +---------------------------------+ > | table for SHA256 | > | aaaaa -> 2 | > | bbbbb -> 1 | > +---------------------------------+ > | table for Blake3 | > | 88888 -> 2 | > | 99999 -> 1 | > +---------------------------------+ > | main table | > | 11111111 -> bbbbbbbb -> 9999999 | > | 22222222 -> aaaaaaaa -> 8888888 | > +---------------------------------+ > | trailer | > | trailer hash | > +---------------------------------+ > > Overall you only have to store the full object ID for each hash exactly > once, and the mappings also only have to be stored once. But you can > look up an ID by each of its formats via its indices. This is very similar to what we have now, except that it has mapping offsets for each algorithm instead of the second and subsequent algorithms and it re-orders the location of the full object IDs. I also intentionally wanted to produce completely deterministic output, since in `git verify-pack` we verify that the output is byte-for-byte identical and I wanted to have the ability to do that here as well. (It isn't implemented yet, but that's a goal.) In order to do that, we need to write every part of the data in a fixed order, so we'd have to define the main table as being sorted by the first algorithm. > With some slight adjustments one could also adapt this format to become > streamable: I don't think these formats are as streamable as you might like. In order to create the tables, we need to sort the data for each algorithm to find the short name length, which requires knowing all of the data up front in order. I, too, thought that might be a nice idea, but when I implemented pack index v3, I realized that effectively all of the data has to be computed up front. Once you do that, computing the offsets isn't hard because it's just some addition and multiplication. I personally like a header with offsets better than a trailer since it makes parsing easier. We can peek at the first 64 bytes of the file to see if it meets our needs or has data we're interested in. -- brian m. carlson (they/them) Toronto, Ontario, CA [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 262 bytes --] ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH 12/14] rust: add a new binary loose object map format 2025-10-29 1:37 ` brian m. carlson @ 2025-10-29 9:07 ` Patrick Steinhardt 0 siblings, 0 replies; 101+ messages in thread From: Patrick Steinhardt @ 2025-10-29 9:07 UTC (permalink / raw) To: brian m. carlson, git, Junio C Hamano, Ezekiel Newren On Wed, Oct 29, 2025 at 01:37:49AM +0000, brian m. carlson wrote: > On 2025-10-28 at 09:18:32, Patrick Steinhardt wrote: > > Doesn't this indicate that calling this "loose object map" is kind of a > > misnomer? If we want to be able to store arbitrary objects regardless of > > the way those are stored (or not stored) in the ODB then I think it's > > overall quite confusing to have "loose" in the name. > > > > This isn't something we can fix for the old loose object map. But > > shouldn't we fix this now for the new format you're about to introduce? > > Sure. I will admit I'm terrible at naming things. What do you think it > should be called. I think the name is quite descriptive despite the misleading "loose" part. So can't we simply drop that part and call it "object map"? [snip] > > > + * A table of 4-byte metadata values. > > > + * Zero or more chunks. A chunk starts with a four-byte chunk identifier and > > > + a four-byte parameter (which, if unneeded, is all zeros) and an eight-byte > > > + size (not including the identifier, parameter, or size), plus the chunk > > > + data. > > > +- Zero or more NUL bytes. > > > +- Tables for subsequent object formats: > > > + * A sorted table of shortened object names. These are prefixes of the names > > > + of all objects in this file, packed together without offset values to > > > + reduce the cache footprint of the binary search for a specific object name. > > > + * A table of full object names in the order specified by the first object format. > > > > Interesting, why are these sorted by the first object format again? > > Doesn't that mean that I have to do a linear search now to locate the > > entry for the second object format? > > No, it doesn't. The full object names are always in the order of the > first format. The shortened names for second and subsequent formats > point into an offset table that finds the offset in the first format. > > Therefore, to look up an OID in the second format knowing its OID in the > first format, you use the first format's prefixes to find its offset, > verify its OID in the full object names, and then look up that offset in > the list of full object names in the second format. > > To go the other way, you find the prefix in the second format, find its > corresponding offset in the mapping table, verify the full object ID in > the second format, and then look up that offset in the full object names > in the first format. Okay. [snip] > > Overall you only have to store the full object ID for each hash exactly > > once, and the mappings also only have to be stored once. But you can > > look up an ID by each of its formats via its indices. > > This is very similar to what we have now, except that it has mapping > offsets for each algorithm instead of the second and subsequent > algorithms and it re-orders the location of the full object IDs. > > I also intentionally wanted to produce completely deterministic output, > since in `git verify-pack` we verify that the output is byte-for-byte > identical and I wanted to have the ability to do that here as well. (It > isn't implemented yet, but that's a goal.) In order to do that, we need > to write every part of the data in a fixed order, so we'd have to define > the main table as being sorted by the first algorithm. Okay. > > With some slight adjustments one could also adapt this format to become > > streamable: > > I don't think these formats are as streamable as you might like. In > order to create the tables, we need to sort the data for each algorithm > to find the short name length, which requires knowing all of the data up > front in order. > > I, too, thought that might be a nice idea, but when I implemented pack > index v3, I realized that effectively all of the data has to be computed > up front. Once you do that, computing the offsets isn't hard because > it's just some addition and multiplication. I guess you can make it streamable if you don't care about deterministic output and if you're willing to have a separate ordered lookup table for the first hash. But in any case you'd have to keep all object IDs in memory regardless of that so that those can be sorted. I'm not sure that this really buys us much. So overall I'm fine with it not being streamable. > I personally like a header with offsets better than a trailer since it > makes parsing easier. We can peek at the first 64 bytes of the file to > see if it meets our needs or has data we're interested in. It's not all that bad -- we for example use this for reftables. Both for reftables and also for your format we'd mmap anyway, and in order to mmap you need to figure out the overall size of the file first. From there on it shouldn't be hard to figure out whether the trailer starts based on the number of hashes and their respective sizes announced in the header. But I remember that this led to some head scratching for myself when I initially dived into the reftable library, so I very much acknowledge that it at least adds _some_ complexity. Anyway, thanks for these explanations! One suggestion: it helped me quite a bit to draw the ASCII diagrams I had in my previous mail. How about we add such a diagram to help readers a bit with the high-level structure of the format? Patrick ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH 12/14] rust: add a new binary loose object map format 2025-10-27 0:44 ` [PATCH 12/14] rust: add a new binary loose object map format brian m. carlson 2025-10-28 9:18 ` Patrick Steinhardt @ 2025-10-29 17:03 ` Junio C Hamano 2025-10-29 18:21 ` Junio C Hamano 2 siblings, 0 replies; 101+ messages in thread From: Junio C Hamano @ 2025-10-29 17:03 UTC (permalink / raw) To: brian m. carlson; +Cc: git, Patrick Steinhardt, Ezekiel Newren "brian m. carlson" <sandals@crustytoothpaste.net> writes: > Our current loose object format has a few problems. First, it is not > efficient: the list of object IDs is not sorted and even if it were, > there would not be an efficient way to look up objects in both > algorithms. I was confused by reading the above, mostly because "our current loose object format" meant to me the "<type> SP <length-in-decimal> NUL <payload>" deflated with zlib, which has no list of object IDs. As Patrick commented you are talking about something else? Mapping mechanism for object names between primary and compat hash algorithms? > +== Loose object mapping > + > +When the `compatObjectFormat` option is used, Git needs to store a mapping > +between the repository's main algorithm and the compatibility algorithm. There > +are two formats for this: the legacy mapping and the modern mapping. > + > +=== Legacy mapping > + > +The compatibility mapping is stored in a file called > +`$GIT_DIR/objects/loose-object-idx`. The format of this file looks like this: > + > + # loose-object-idx > + (main-name SP compat-name LF)* > + > +`main-name` refers to hexadecimal object ID of the object in the main > +repository format and `compat-name` refers to the same thing, but for the > +compatibility format. > + > +This format is read if it exists but is not written. > + > +Note that carriage returns are not permitted in this file, regardless of the > +host system or configuration. Unless it is zero cost to keep supporting the reading side, perhaps we want to drop this mapping file format? ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH 12/14] rust: add a new binary loose object map format 2025-10-27 0:44 ` [PATCH 12/14] rust: add a new binary loose object map format brian m. carlson 2025-10-28 9:18 ` Patrick Steinhardt 2025-10-29 17:03 ` Junio C Hamano @ 2025-10-29 18:21 ` Junio C Hamano 2 siblings, 0 replies; 101+ messages in thread From: Junio C Hamano @ 2025-10-29 18:21 UTC (permalink / raw) To: brian m. carlson; +Cc: git, Patrick Steinhardt, Ezekiel Newren "brian m. carlson" <sandals@crustytoothpaste.net> writes: > +=== Modern mapping > + > +The modern mapping consists of a set of files under `$GIT_DIR/objects/loose` > +ending in `.map`. The portion of the filename before the extension is that of > +the hash checksum in hex format. > + > +`git pack-objects` will repack existing entries into one file, removing any > +unnecessary objects, such as obsolete shallow entries or loose objects that > +have been packed. > + > +==== Mapping file format I know near the end of this document we talk about network-byte order, but let's say that upfront here. > +- A header appears at the beginning and consists of the following: > + * A 4-byte mapping signature: `LMAP` > + * 4-byte version number: 1 > + * 4-byte length of the header section. > + * 4-byte number of objects declared in this map file. > + * 4-byte number of object formats declared in this map file. > + * For each object format: > + ** 4-byte format identifier (e.g., `sha1` for SHA-1) > + ** 4-byte length in bytes of shortened object names. This is the > + shortest possible length needed to make names in the shortened > + object name table unambiguous. This number typically represents a small integer up to 32 or so, right? No objection to spend 4-byte for it, but initially I somehow was confused into thinking that this is the number of bytes for shortened object names of all the objects in this map file (i.e., (N * 6) if the map describes N objects, and 6-byte is sufficient prefix of the object names). I wonder if there is a way to rephrase the above to avoid such confusion? Also I assume that "shorten" refers to "take the first N-byte prefix". How about calling them "unique prefix of object names" or something? > + ** 8-byte integer, recording where tables relating to this format > + are stored in this index file, as an offset from the beginning. > + * 8-byte offset to the trailer from the beginning of this file. OK. > + * Zero or more additional key/value pairs (4-byte key, 4-byte value), which > + may optionally declare one or more chunks. No chunks are currently > + defined. Readers must ignore unrecognized keys. Is this misindented? In other words, shouldn't the "padding" sit immediately after "offset of the trailer in the file" and at the same level? This uses the word "chunk", which risks implying some relationship with what is described in Documentation/gitformat-chunk.adoc, but I suspect this file format has nothing to do with "Chunk-based file format" described there. "4-byte key plus 4-byte value" gives an impression that it is a dictionary to associate bunch of 4-byte words with 4-byte values, and it is hard to guess where the word "chunk" comes from. 4-byte keyword plus 4-byte offset into (a later part of) the file where the chunk defined by that keyword is stored? The length of the header part minus the size up to the 8-byte offset to the trailer defines the size occupied by "additional key/value pairs", so the reader is supposed to tell if the next 4-byte is a key that it cannot recognise or beyond the end of the header part? How about replacing this with * The remainder of the header section is reserved for future use. Readers must ignore this section. until we know what kind of "chunks" are needed? > +- Zero or more NUL bytes. These are used to improve the alignment of the > + 4-byte quantities below. Everything we saw so far, if the tail end of the header section that is reserved for future use would hold zero or more <4-byte key, 4-byte value> pairs, are of size divisible by 4. If anything, we may be better off saying * all the sections described below are placed contiguously without gap in the file * all the sections are padded with zero or more NUL bytes to make their length a multiple of 4 upfront, even before we start talking about the "header" section. Then the "Zero or more NUL bytes" here, and the padding between tables do not have to be explicitly described. > +- Tables for the first object format: > + * A sorted table of shortened object names. These are prefixes of the names > + of all objects in this file, packed together without offset values to > + reduce the cache footprint of the binary search for a specific object name. "packed together without offset values...", while understandable, smells a bit out of place, especially since you haven't explained what you are trying to let readers find out from this table when they have one object name. Presumably, you have them take the first "length in bytes of shortened object names" bytes from the object name they have, binary search in this unique-prefix table for an entry that matches the prefix, to find out that their object may appear as the N-th object in the table (but the document hasn't told the readers that is how this table is designed to be used yet)? And using that offset, the reader would probably ensure that the N-th entry that appears in the next "full object names" table does indeed fully match the object they have? If that is the case, it is obvious that there is no "offset value" needed here, but when the reader does not even know how this table is supposed to be used, a sudden mention of "offset values" only confuses them. > + * A sorted table of full object names. I assume that the above two "*" bullet points are supposed to be aligned (iow, sit at the same level under "Tables for the first object format"). In any case, our reader with a single object name would have found out that their object appears as the N-th entry of these two tables. > + * A table of 4-byte metadata values. Again, is this (and the next) "*" bullet point at the same level as the above two tables? The number of entries in this table is not specified. Is it one 4-byte metadata per object described in the table (i.e. our reader recalls that the header has a 4-byte number of objects declared in this file)? IOW, would our reader, after finding out that the object they have is found as the N-th entry in the previous "full object names" table, look at the N-th entry of this metadata value table to find the metadata for their object? > + * Zero or more chunks. A chunk starts with a four-byte chunk identifier and > + a four-byte parameter (which, if unneeded, is all zeros) and an eight-byte > + size (not including the identifier, parameter, or size), plus the chunk > + data. When the chunk data is not multiple of 4-byte, don't we pad? If we do, would the padding included in the 8-byte size? Or if the first chunk is of an odd size, would the second chunk be unaligned from its identifier, parameter and size fields? Presumably, you will allow older readers to safely skip chunks of newer type they do not recognise, so a reader is expected to grab the first 16 bytes for (id, param, size), and if it does not care about the id, just skip the size bytes to reach the next chunk, so if we were to pad (which I think would be reasonable, given that you are padding sections to 4-byte boundaries), the eight-byte size would also count the padding at the end of the chunk data (if the chunk data needs padding at the end, that is). If we make it clear that these chunks are aligned at 4-byte (or 8-byte, I dunno) boundaries, then ... > +- Zero or more NUL bytes. ... we do not need to have this entry whose length is unspecified (I can guess that you added it to allow the reader to skip to the next 4-byte boundary, but this document does not really specify it). > +- Tables for subsequent object formats: > + * A sorted table of shortened object names. These are prefixes of the names > + of all objects in this file, packed together without offset values to > + reduce the cache footprint of the binary search for a specific object name. > + * A table of full object names in the order specified by the first object format. > + * A table of 4-byte values mapping object name order to the order of the > + first object format. For an object in the table of sorted shortened object > + names, the value at the corresponding index in this table is the index in > + the previous table for that same object. > + * Zero or more NUL bytes. The same comment as the section for the primary object format. I assume that the above four "*" bullet points are at the same level, i.e. one unique-prefix table to let reader with a single object name to find that their object may be the one at N-th location in the table, followed by the full object name table to verify that the N-th object indeed is their object, and then find from that N that the correponding object name in the other hash is the M-th object in the table in the first object format, and they go from this M to the 4-byte metadata for that object? > +- The trailer consists of the following: > + * Hash checksum of all of the above. > + > +The lower six bits of each metadata table contain a type field indicating the > +reason that this object is stored: > + > +0:: > + Reserved. > +1:: > + This object is stored as a loose object in the repository. > +2:: > + This object is a shallow entry. The mapping refers to a shallow value > + returned by a remote server. > +3:: > + This object is a submodule entry. The mapping refers to the commit stored > + representing a submodule. > + > +Other data may be stored in this field in the future. Bits that are not used > +must be zero. > + > +All 4-byte numbers are in network order and must be 4-byte aligned in the file, > +so the NUL padding may be required in some cases. The document needs to be clear if the "length" field for each section counts these padding. > +impl LooseObjectMemoryMap { > + /// Create a new `LooseObjectMemoryMap`. > + /// > + /// The storage and compatibility `HashAlgorithm` instances are used to store the object IDs in > + /// the correct map. > + fn new(storage: HashAlgorithm, compat: HashAlgorithm) -> LooseObjectMemoryMap { > + LooseObjectMemoryMap { > + to_compat: BTreeMap::new(), > + to_storage: BTreeMap::new(), > + compat, > + storage, > + } > + } > + > + fn len(&self) -> usize { > + self.to_compat.len() > + } > + > + /// Write this map to an interface implementing `std::io::Write`. > + fn write<W: Write>(&self, wrtr: W) -> io::Result<()> { > + const VERSION_NUMBER: u32 = 1; > + const NUM_OBJECT_FORMATS: u32 = 2; > + const PADDING: [u8; 4] = [0u8; 4]; > + > + let mut wrtr = wrtr; > + let header_size: u32 = 4 + 4 + 4 + 4 + 4 + (4 + 4 + 8) * 2 + 8; Yikes. Can this be written in a way that is easier to maintain? Certainly the earlier run of 4's corresponds to what the code below writes to wrtr, and I am wondering if we can ask wrtr how many bytes we have asked it to write so far, or something, without having the above hard-to-read numbers. > + wrtr.write_all(b"LMAP")?; > + wrtr.write_all(&VERSION_NUMBER.to_be_bytes())?; > + wrtr.write_all(&header_size.to_be_bytes())?; > + wrtr.write_all(&(self.to_compat.len() as u32).to_be_bytes())?; > + wrtr.write_all(&NUM_OBJECT_FORMATS.to_be_bytes())?; > + > + let storage_short_len = self.find_short_name_len(&self.to_compat, self.storage); > + let compat_short_len = self.find_short_name_len(&self.to_storage, self.compat); > + > + let storage_npadding = Self::required_nul_padding(self.to_compat.len(), storage_short_len); > + let compat_npadding = Self::required_nul_padding(self.to_compat.len(), compat_short_len); I said 100-column limit is OK, but I am already hating myself saying so. ^ permalink raw reply [flat|nested] 101+ messages in thread
* [PATCH 13/14] rust: add a small wrapper around the hashfile code 2025-10-27 0:43 [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 brian m. carlson ` (11 preceding siblings ...) 2025-10-27 0:44 ` [PATCH 12/14] rust: add a new binary loose object map format brian m. carlson @ 2025-10-27 0:44 ` brian m. carlson 2025-10-28 18:19 ` Ezekiel Newren 2025-10-27 0:44 ` [PATCH 14/14] object-file-convert: always make sure object ID algo is valid brian m. carlson ` (4 subsequent siblings) 17 siblings, 1 reply; 101+ messages in thread From: brian m. carlson @ 2025-10-27 0:44 UTC (permalink / raw) To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren Our new binary loose object map code avoids needing to be intimately involved with file handling by simply writing data to an object implement Write. This makes it very easy to test by writing to a Cursor wrapping a Vec for tests, and thus decouples it from intimate knowledge about how we handle files. However, we will actually want to write our data to an actual file, since that's the most practical way to persist data. Implement a wrapper around the hashfile code that implements the Write trait so that we can write our loose object map into a file. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> --- Makefile | 1 + src/csum_file.rs | 81 ++++++++++++++++++++++++++++++++++++++++++++++++ src/lib.rs | 1 + src/meson.build | 1 + 4 files changed, 84 insertions(+) create mode 100644 src/csum_file.rs diff --git a/Makefile b/Makefile index 2081b13780..8eb31aeed2 100644 --- a/Makefile +++ b/Makefile @@ -1521,6 +1521,7 @@ CLAR_TEST_OBJS += $(UNIT_TEST_DIR)/unit-test.o UNIT_TEST_OBJS += $(UNIT_TEST_DIR)/test-lib.o +RUST_SOURCES += src/csum_file.rs RUST_SOURCES += src/hash.rs RUST_SOURCES += src/lib.rs RUST_SOURCES += src/loose.rs diff --git a/src/csum_file.rs b/src/csum_file.rs new file mode 100644 index 0000000000..7f2c6c4fcb --- /dev/null +++ b/src/csum_file.rs @@ -0,0 +1,81 @@ +// This program is free software; you can redistribute it and/or modify +// it under the terms of the GNU General Public License as published by +// the Free Software Foundation: version 2 of the License, dated June 1991. +// +// This program is distributed in the hope that it will be useful, +// but WITHOUT ANY WARRANTY; without even the implied warranty of +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +// GNU General Public License for more details. +// +// You should have received a copy of the GNU General Public License along +// with this program; if not, see <https://www.gnu.org/licenses/>. + +use crate::hash::{HashAlgorithm, GIT_MAX_RAWSZ}; +use std::ffi::CStr; +use std::io::{self, Write}; +use std::os::raw::c_void; + +/// A writer that can write files identified by their hash or containing a trailing hash. +pub struct HashFile { + ptr: *mut c_void, + algo: HashAlgorithm, +} + +impl HashFile { + /// Create a new HashFile. + /// + /// The hash used will be `algo`, its name should be in `name`, and an open file descriptor + /// pointing to that file should be in `fd`. + pub fn new(algo: HashAlgorithm, fd: i32, name: &CStr) -> HashFile { + HashFile { + ptr: unsafe { c::hashfd(algo.hash_algo_ptr(), fd, name.as_ptr()) }, + algo, + } + } + + /// Finalize this HashFile instance. + /// + /// Returns the hash computed over the data. + pub fn finalize(self, component: u32, flags: u32) -> Vec<u8> { + let mut result = vec![0u8; GIT_MAX_RAWSZ]; + unsafe { c::finalize_hashfile(self.ptr, result.as_mut_ptr(), component, flags) }; + result.truncate(self.algo.raw_len()); + result + } +} + +impl Write for HashFile { + fn write(&mut self, data: &[u8]) -> io::Result<usize> { + for chunk in data.chunks(u32::MAX as usize) { + unsafe { + c::hashwrite( + self.ptr, + chunk.as_ptr() as *const c_void, + chunk.len() as u32, + ) + }; + } + Ok(data.len()) + } + + fn flush(&mut self) -> io::Result<()> { + unsafe { c::hashflush(self.ptr) }; + Ok(()) + } +} + +pub mod c { + use std::os::raw::{c_char, c_int, c_void}; + + extern "C" { + pub fn hashfd(algop: *const c_void, fd: i32, name: *const c_char) -> *mut c_void; + pub fn hashwrite(f: *mut c_void, data: *const c_void, len: u32); + pub fn hashflush(f: *mut c_void); + pub fn finalize_hashfile( + f: *mut c_void, + data: *mut u8, + component: u32, + flags: u32, + ) -> c_int; + } +} diff --git a/src/lib.rs b/src/lib.rs index 442f9433dc..0c598298b1 100644 --- a/src/lib.rs +++ b/src/lib.rs @@ -1,3 +1,4 @@ +pub mod csum_file; pub mod hash; pub mod loose; pub mod varint; diff --git a/src/meson.build b/src/meson.build index 1eea068519..45739957b4 100644 --- a/src/meson.build +++ b/src/meson.build @@ -1,4 +1,5 @@ libgit_rs_sources = [ + 'csum_file.rs', 'hash.rs', 'lib.rs', 'loose.rs', ^ permalink raw reply related [flat|nested] 101+ messages in thread
* Re: [PATCH 13/14] rust: add a small wrapper around the hashfile code 2025-10-27 0:44 ` [PATCH 13/14] rust: add a small wrapper around the hashfile code brian m. carlson @ 2025-10-28 18:19 ` Ezekiel Newren 2025-10-29 1:39 ` brian m. carlson 0 siblings, 1 reply; 101+ messages in thread From: Ezekiel Newren @ 2025-10-28 18:19 UTC (permalink / raw) To: brian m. carlson; +Cc: git, Junio C Hamano, Patrick Steinhardt On Sun, Oct 26, 2025 at 6:44 PM brian m. carlson <sandals@crustytoothpaste.net> wrote: > +use crate::hash::{HashAlgorithm, GIT_MAX_RAWSZ}; > +use std::ffi::CStr; > +use std::io::{self, Write}; > +use std::os::raw::c_void; std::os::raw has been deprecated, only std::ffi should be used. > +/// A writer that can write files identified by their hash or containing a trailing hash. > +pub struct HashFile { > + ptr: *mut c_void, > + algo: HashAlgorithm, > +} > + > +impl HashFile { > + /// Create a new HashFile. > + /// > + /// The hash used will be `algo`, its name should be in `name`, and an open file descriptor > + /// pointing to that file should be in `fd`. > + pub fn new(algo: HashAlgorithm, fd: i32, name: &CStr) -> HashFile { > + HashFile { > + ptr: unsafe { c::hashfd(algo.hash_algo_ptr(), fd, name.as_ptr()) }, > + algo, > + } > + } - pub fn new(algo: HashAlgorithm, fd: i32, name: &CStr) -> HashFile { - HashFile { + pub fn new(algo: HashAlgorithm, fd: i32, name: &CStr) -> Self { + Self { > + /// Finalize this HashFile instance. > + /// > + /// Returns the hash computed over the data. > + pub fn finalize(self, component: u32, flags: u32) -> Vec<u8> { > + let mut result = vec![0u8; GIT_MAX_RAWSZ]; > + unsafe { c::finalize_hashfile(self.ptr, result.as_mut_ptr(), component, flags) }; > + result.truncate(self.algo.raw_len()); > + result > + } > +} > + > +impl Write for HashFile { > + fn write(&mut self, data: &[u8]) -> io::Result<usize> { > + for chunk in data.chunks(u32::MAX as usize) { > + unsafe { > + c::hashwrite( > + self.ptr, > + chunk.as_ptr() as *const c_void, > + chunk.len() as u32, > + ) > + }; > + } > + Ok(data.len()) > + } > + > + fn flush(&mut self) -> io::Result<()> { > + unsafe { c::hashflush(self.ptr) }; > + Ok(()) > + } > +} It's always nice to implement the _Write_ trait for any type that consumes &[u8] slices. It makes it easy to use a plethora of standard library functions. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH 13/14] rust: add a small wrapper around the hashfile code 2025-10-28 18:19 ` Ezekiel Newren @ 2025-10-29 1:39 ` brian m. carlson 0 siblings, 0 replies; 101+ messages in thread From: brian m. carlson @ 2025-10-29 1:39 UTC (permalink / raw) To: Ezekiel Newren; +Cc: git, Junio C Hamano, Patrick Steinhardt [-- Attachment #1: Type: text/plain, Size: 617 bytes --] On 2025-10-28 at 18:19:27, Ezekiel Newren wrote: > On Sun, Oct 26, 2025 at 6:44 PM brian m. carlson > <sandals@crustytoothpaste.net> wrote: > > > +use crate::hash::{HashAlgorithm, GIT_MAX_RAWSZ}; > > +use std::ffi::CStr; > > +use std::io::{self, Write}; > > +use std::os::raw::c_void; > > std::os::raw has been deprecated, only std::ffi should be used. std::ffi with the C types is not available until Rust 1.64 and we're not planning on targeting that for some time. This was intentional, but I'll mention it in the commit message for v2. -- brian m. carlson (they/them) Toronto, Ontario, CA [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 262 bytes --] ^ permalink raw reply [flat|nested] 101+ messages in thread
* [PATCH 14/14] object-file-convert: always make sure object ID algo is valid 2025-10-27 0:43 [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 brian m. carlson ` (12 preceding siblings ...) 2025-10-27 0:44 ` [PATCH 13/14] rust: add a small wrapper around the hashfile code brian m. carlson @ 2025-10-27 0:44 ` brian m. carlson 2025-10-29 20:07 ` [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 Junio C Hamano ` (3 subsequent siblings) 17 siblings, 0 replies; 101+ messages in thread From: brian m. carlson @ 2025-10-27 0:44 UTC (permalink / raw) To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren In some cases, we zero-initialize our object IDs, which sets the algo member to zero as well, which is not a valid algorithm number. This is a bad practice, but we typically paper over it in many cases by simply substituting the repository's hash algorithm. However, our new Rust loose object map code doesn't handle this gracefully and can't find object IDs when the algorithm is zero because they don't compare equal to those with the correct algo field. In addition, the comparison code doesn't have any knowledge of what the main algorithm is because that's global state, so we can't adjust the comparison. To make our code function properly and to avoid propagating these bad entries, if we get a source object ID with a zero algo, just make a copy of it with the fixed algorithm. This has the benefit of also fixing the object IDs if we're in a single algorithm mode as well. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> --- object-file-convert.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/object-file-convert.c b/object-file-convert.c index e44c821084..f8dce94811 100644 --- a/object-file-convert.c +++ b/object-file-convert.c @@ -13,7 +13,7 @@ #include "gpg-interface.h" #include "object-file-convert.h" -int repo_oid_to_algop(struct repository *repo, const struct object_id *src, +int repo_oid_to_algop(struct repository *repo, const struct object_id *srcoid, const struct git_hash_algo *to, struct object_id *dest) { /* @@ -21,7 +21,15 @@ int repo_oid_to_algop(struct repository *repo, const struct object_id *src, * default hash algorithm for that object. */ const struct git_hash_algo *from = - src->algo ? &hash_algos[src->algo] : repo->hash_algo; + srcoid->algo ? &hash_algos[srcoid->algo] : repo->hash_algo; + struct object_id temp; + const struct object_id *src = srcoid; + + if (!srcoid->algo) { + oidcpy(&temp, srcoid); + temp.algo = hash_algo_by_ptr(repo->hash_algo); + src = &temp; + } if (from == to || !to) { if (src != dest) ^ permalink raw reply related [flat|nested] 101+ messages in thread
* Re: [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 2025-10-27 0:43 [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 brian m. carlson ` (13 preceding siblings ...) 2025-10-27 0:44 ` [PATCH 14/14] object-file-convert: always make sure object ID algo is valid brian m. carlson @ 2025-10-29 20:07 ` Junio C Hamano 2025-10-29 20:15 ` Junio C Hamano 2025-11-11 0:12 ` Ezekiel Newren ` (2 subsequent siblings) 17 siblings, 1 reply; 101+ messages in thread From: Junio C Hamano @ 2025-10-29 20:07 UTC (permalink / raw) To: brian m. carlson; +Cc: git, Patrick Steinhardt, Ezekiel Newren "brian m. carlson" <sandals@crustytoothpaste.net> writes: > This is the second part of the SHA-1/SHA-256 interoperability work. It > introduces our first major use of Rust code to implement a loose object > format as well as preparatory work to make that happen, including > changing types to more Rust-friendly ones. Since Rust will be required > for the interoperability work, we require that in the testsuite. So, "make WITH_RUST=YesPlease" on 'seen' seems to barf like so (line wrapping added by me): ... AR libgit.a CARGO target/release/libgitcore.a error: the `cargo::` syntax for build script output instructions was added \ in Rust 1.77.0, but the minimum supported Rust version of `gitcore \ v0.1.0 (/home/jch/w/git.build)` is 1.49.0. Switch to the old `cargo:rustc-link-search=.` syntax (note the single colon). See https://doc.rust-lang.org/cargo/reference/build-scripts.html#outputs-of\ -the-build-script for more information about build script outputs. gmake: *** [Makefile:2964: target/release/libgitcore.a] Error 101 We either need to downdate the syntax or do the following, but IIRC, 1.77 is a bit too new for the Debian oldstable? Cargo.toml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git c/Cargo.toml w/Cargo.toml index 2f51bf5d5f..7772321dd7 100644 --- c/Cargo.toml +++ w/Cargo.toml @@ -2,7 +2,7 @@ name = "gitcore" version = "0.1.0" edition = "2018" -rust-version = "1.49.0" +rust-version = "1.77.0" [lib] crate-type = ["staticlib"] ^ permalink raw reply related [flat|nested] 101+ messages in thread
* Re: [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 2025-10-29 20:07 ` [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 Junio C Hamano @ 2025-10-29 20:15 ` Junio C Hamano 0 siblings, 0 replies; 101+ messages in thread From: Junio C Hamano @ 2025-10-29 20:15 UTC (permalink / raw) To: brian m. carlson; +Cc: git, Patrick Steinhardt, Ezekiel Newren Junio C Hamano <gitster@pobox.com> writes: > We either need to downdate the syntax or do the following, but IIRC, > 1.77 is a bit too new for the Debian oldstable? > > Cargo.toml | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git c/Cargo.toml w/Cargo.toml > index 2f51bf5d5f..7772321dd7 100644 > --- c/Cargo.toml > +++ w/Cargo.toml > @@ -2,7 +2,7 @@ > name = "gitcore" > version = "0.1.0" > edition = "2018" > -rust-version = "1.49.0" > +rust-version = "1.77.0" > > [lib] > crate-type = ["staticlib"] For now, I'd add this on top of the topic and rebuild 'seen'. --- >8 --- Subject: [PATCH] SQUASH??? downgrade build.rs syntax As the build with "make WITH_RUST=YesPlease" dies like so ... AR libgit.a CARGO target/release/libgitcore.a error: the `cargo::` syntax for build script output instructions was added in \ Rust 1.77.0, but the minimum supported Rust version of `gitcore v0.1.0 \ (/home/gitster/w/git.build)` is 1.49.0. Switch to the old `cargo:rustc-link-search=.` syntax (note the single colon). See https://doc.rust-lang.org/cargo/reference/build-scripts.html#outputs-of-\ the-build-script for more information about build script outputs. gmake: *** [Makefile:2964: target/release/libgitcore.a] Error 101 work it around by downgrading the syntax as the error messages suggests. --- build.rs | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/build.rs b/build.rs index 136d58c35a..3228367b5d 100644 --- a/build.rs +++ b/build.rs @@ -11,11 +11,11 @@ // with this program; if not, see <https://www.gnu.org/licenses/>. fn main() { - println!("cargo::rustc-link-search=."); - println!("cargo::rustc-link-search=reftable"); - println!("cargo::rustc-link-search=xdiff"); - println!("cargo::rustc-link-lib=git"); - println!("cargo::rustc-link-lib=reftable"); - println!("cargo::rustc-link-lib=z"); - println!("cargo::rustc-link-lib=xdiff"); + println!("cargo:rustc-link-search=."); + println!("cargo:rustc-link-search=reftable"); + println!("cargo:rustc-link-search=xdiff"); + println!("cargo:rustc-link-lib=git"); + println!("cargo:rustc-link-lib=reftable"); + println!("cargo:rustc-link-lib=z"); + println!("cargo:rustc-link-lib=xdiff"); } -- 2.51.2-698-g3eff15350e ^ permalink raw reply related [flat|nested] 101+ messages in thread
* Re: [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 2025-10-27 0:43 [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 brian m. carlson ` (14 preceding siblings ...) 2025-10-29 20:07 ` [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 Junio C Hamano @ 2025-11-11 0:12 ` Ezekiel Newren 2025-11-14 17:25 ` Junio C Hamano 2025-11-17 22:16 ` [PATCH v2 00/15] " brian m. carlson 17 siblings, 0 replies; 101+ messages in thread From: Ezekiel Newren @ 2025-11-11 0:12 UTC (permalink / raw) To: brian m. carlson; +Cc: git, Junio C Hamano, Patrick Steinhardt On Sun, Oct 26, 2025 at 6:44 PM brian m. carlson <sandals@crustytoothpaste.net> wrote: > > This is the second part of the SHA-1/SHA-256 interoperability work. It > introduces our first major use of Rust code to implement a loose object > format as well as preparatory work to make that happen, including > changing types to more Rust-friendly ones. Since Rust will be required > ... I'm working on a patch series that converts the Cargo crate into a Cargo workspace. This means that /src will be moved to /gitcore/src. I plan on releasing that patch series after v2.52.0 is released. Using a Cargo workspace over a single crate is discussed partially in [1]. Patrick has decided to let me introduce cbindgen and the Cargo workspace conversion [2]. [1] Patrick's patch series on cbindgen https://lore.kernel.org/git/20251023-b4-pks-rust-cbindgen-v1-0-c19b61b03127@pks.im/ [2] Patrick discarding his patch series https://lore.kernel.org/git/aQ3XOTX0AT_eFc5P@pks.im/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 2025-10-27 0:43 [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 brian m. carlson ` (15 preceding siblings ...) 2025-11-11 0:12 ` Ezekiel Newren @ 2025-11-14 17:25 ` Junio C Hamano 2025-11-14 21:11 ` Junio C Hamano 2025-11-17 6:56 ` Junio C Hamano 2025-11-17 22:16 ` [PATCH v2 00/15] " brian m. carlson 17 siblings, 2 replies; 101+ messages in thread From: Junio C Hamano @ 2025-11-14 17:25 UTC (permalink / raw) To: brian m. carlson; +Cc: git, Patrick Steinhardt, Ezekiel Newren "brian m. carlson" <sandals@crustytoothpaste.net> writes: > The new Rust files have adopted an approach that is slightly different > from some of our other files and placed a license notice at the top. > This is required because of DCO part (a): "I have the right to submit it > under the open source license indicated in the file". It also avoids > ambiguity if the file is copied into a separate location (such as an LLM > training corpus). You may be aware of them already, but just in case, I was looking at CI breakages and noticed that "cargo clippy" warnings added in 4b44c464 (ci: check for common Rust mistakes via Clippy, 2025-10-15) https://github.com/git/git/actions/runs/19346329259/job/55347554528#step:5:73 mostly seem to come from steps 12 and 13 of this series. Thanks. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 2025-11-14 17:25 ` Junio C Hamano @ 2025-11-14 21:11 ` Junio C Hamano 2025-11-17 6:56 ` Junio C Hamano 1 sibling, 0 replies; 101+ messages in thread From: Junio C Hamano @ 2025-11-14 21:11 UTC (permalink / raw) To: brian m. carlson; +Cc: git, Patrick Steinhardt, Ezekiel Newren Junio C Hamano <gitster@pobox.com> writes: > "brian m. carlson" <sandals@crustytoothpaste.net> writes: > >> The new Rust files have adopted an approach that is slightly different >> from some of our other files and placed a license notice at the top. >> This is required because of DCO part (a): "I have the right to submit it >> under the open source license indicated in the file". It also avoids >> ambiguity if the file is copied into a separate location (such as an LLM >> training corpus). > > You may be aware of them already, but just in case, I was looking at > CI breakages and noticed that "cargo clippy" warnings added in > 4b44c464 (ci: check for common Rust mistakes via Clippy, 2025-10-15) > > https://github.com/git/git/actions/runs/19346329259/job/55347554528#step:5:73 > > mostly seem to come from steps 12 and 13 of this series. > > Thanks. This is what I queued on top for today's integration run in an attempt to work it around. I am happy about the changes to assert_eq!(*, [true|false]), even though I may not be happy that clippy is unhappy about this particular construct. I also am not so unhappy with the "do not needlessly borrow" changes near the end. The first hunk in src/loose.rs thing is a monkey-see-monkey-do patch that may or may not make any sense, which I strongly want to be replaced with a proper update by somebody who knows what they are doing. src/hash.rs | 2 +- src/loose.rs | 15 ++++++++------- 2 files changed, 9 insertions(+), 8 deletions(-) diff --git a/src/hash.rs b/src/hash.rs index 8798a50aef..cc696688af 100644 --- a/src/hash.rs +++ b/src/hash.rs @@ -310,7 +310,7 @@ mod tests { ]; for (data, oid) in tests { let mut h = algo.hasher(); - assert_eq!(h.is_safe(), true); + assert!(h.is_safe()); // Test that this works incrementally. h.update(&data[0..2]); h.update(&data[2..]); diff --git a/src/loose.rs b/src/loose.rs index a4e7d2fa48..8d4264c626 100644 --- a/src/loose.rs +++ b/src/loose.rs @@ -700,7 +700,8 @@ mod tests { } } - fn test_entries() -> &'static [(&'static str, &'static [u8], &'static [u8], MapType, bool)] { + type TwoHashesTestVectorEntry = (&'static str, &'static [u8], &'static [u8], MapType, bool); + fn test_entries() -> &'static [TwoHashesTestVectorEntry] { // These are all example blobs containing the content in the first argument. &[ ("abc", b"\xf2\xba\x8f\x84\xab\x5c\x1b\xce\x84\xa7\xb4\x41\xcb\x19\x59\xcf\xc7\x09\x3b\x7f", b"\xc1\xcf\x6e\x46\x50\x77\x93\x0e\x88\xdc\x51\x36\x64\x1d\x40\x2f\x72\xa2\x29\xdd\xd9\x96\xf6\x27\xd6\x0e\x96\x39\xea\xba\x35\xa6", MapType::LooseObject, false), @@ -741,7 +742,7 @@ mod tests { let mut wrtr = TrailingWriter::new(); map.finish_batch(&mut wrtr).unwrap(); - assert_eq!(map.has_batch(), false); + assert!(!map.has_batch()); let data = wrtr.finalize(); MmapedLooseObjectMap::new(&data, HashAlgorithm::SHA256).unwrap(); @@ -754,7 +755,7 @@ mod tests { let mut wrtr = TrailingWriter::new(); map.finish_batch(&mut wrtr).unwrap(); - assert_eq!(map.has_batch(), false); + assert!(!map.has_batch()); let data = wrtr.finalize(); let entries = test_entries(); @@ -886,16 +887,16 @@ mod tests { let s256 = f(HashAlgorithm::SHA256); let s1 = f(HashAlgorithm::SHA1); - let res = map.map_object(&s256, HashAlgorithm::SHA1).unwrap(); + let res = map.map_object(s256, HashAlgorithm::SHA1).unwrap(); assert_eq!(res.oid, *s1); assert_eq!(res.kind, MapType::Reserved); - let res = map.map_oid(&s256, HashAlgorithm::SHA1).unwrap(); + let res = map.map_oid(s256, HashAlgorithm::SHA1).unwrap(); assert_eq!(*res, *s1); - let res = map.map_object(&s1, HashAlgorithm::SHA256).unwrap(); + let res = map.map_object(s1, HashAlgorithm::SHA256).unwrap(); assert_eq!(res.oid, *s256); assert_eq!(res.kind, MapType::Reserved); - let res = map.map_oid(&s1, HashAlgorithm::SHA256).unwrap(); + let res = map.map_oid(s1, HashAlgorithm::SHA256).unwrap(); assert_eq!(*res, *s256); } } -- 2.52.0-rc2-455-g230fcf2819 ^ permalink raw reply related [flat|nested] 101+ messages in thread
* Re: [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 2025-11-14 17:25 ` Junio C Hamano 2025-11-14 21:11 ` Junio C Hamano @ 2025-11-17 6:56 ` Junio C Hamano 2025-11-17 22:09 ` brian m. carlson 1 sibling, 1 reply; 101+ messages in thread From: Junio C Hamano @ 2025-11-17 6:56 UTC (permalink / raw) To: brian m. carlson; +Cc: git, Patrick Steinhardt, Ezekiel Newren Junio C Hamano <gitster@pobox.com> writes: > "brian m. carlson" <sandals@crustytoothpaste.net> writes: > >> The new Rust files have adopted an approach that is slightly different >> from some of our other files and placed a license notice at the top. >> This is required because of DCO part (a): "I have the right to submit it >> under the open source license indicated in the file". It also avoids >> ambiguity if the file is copied into a separate location (such as an LLM >> training corpus). > > You may be aware of them already, but just in case, I was looking at > CI breakages ... In addition to "cargo clippy" I reported earlier (and attempted to fix) in a separate message, we have been seeing constant failure of "win+Meson build" job at GitHub Actions CI. https://github.com/git/git/actions/runs/19414557042/job/55540901761#step:6:848 I attempted to build tonight's 'seen' without this topic and it seemed to stop. https://github.com/git/git/actions/runs/19418361570/job/55551045554 This topic may need a bit of help from those who are clueful with Rust and Windows. Thanks. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 2025-11-17 6:56 ` Junio C Hamano @ 2025-11-17 22:09 ` brian m. carlson 2025-11-18 0:13 ` Junio C Hamano 0 siblings, 1 reply; 101+ messages in thread From: brian m. carlson @ 2025-11-17 22:09 UTC (permalink / raw) To: Junio C Hamano; +Cc: git, Patrick Steinhardt, Ezekiel Newren [-- Attachment #1: Type: text/plain, Size: 939 bytes --] On 2025-11-17 at 06:56:07, Junio C Hamano wrote: > In addition to "cargo clippy" I reported earlier (and attempted to > fix) in a separate message, we have been seeing constant failure of > "win+Meson build" job at GitHub Actions CI. > > https://github.com/git/git/actions/runs/19414557042/job/55540901761#step:6:848 > > I attempted to build tonight's 'seen' without this topic and it > seemed to stop. > > https://github.com/git/git/actions/runs/19418361570/job/55551045554 > > This topic may need a bit of help from those who are clueful with > Rust and Windows. I think that has been failing with Rust since well before my code came in. It has failed for me for a long time (well over a month), so I have just ignored it. I'm going to send v2 shortly, but we can squash in changes and do a v3 if there is something actually broken in this series. -- brian m. carlson (they/them) Toronto, Ontario, CA [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 262 bytes --] ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 2025-11-17 22:09 ` brian m. carlson @ 2025-11-18 0:13 ` Junio C Hamano 2025-11-19 23:04 ` brian m. carlson 0 siblings, 1 reply; 101+ messages in thread From: Junio C Hamano @ 2025-11-18 0:13 UTC (permalink / raw) To: brian m. carlson; +Cc: git, Patrick Steinhardt, Ezekiel Newren "brian m. carlson" <sandals@crustytoothpaste.net> writes: > On 2025-11-17 at 06:56:07, Junio C Hamano wrote: >> In addition to "cargo clippy" I reported earlier (and attempted to >> fix) in a separate message, we have been seeing constant failure of >> "win+Meson build" job at GitHub Actions CI. >> >> https://github.com/git/git/actions/runs/19414557042/job/55540901761#step:6:848 >> >> I attempted to build tonight's 'seen' without this topic and it >> seemed to stop. >> >> https://github.com/git/git/actions/runs/19418361570/job/55551045554 >> >> This topic may need a bit of help from those who are clueful with >> Rust and Windows. > > I think that has been failing with Rust since well before my code came > in. It has failed for me for a long time (well over a month), so I have > just ignored it. > > I'm going to send v2 shortly, but we can squash in changes and do a v3 > if there is something actually broken in this series. Thanks. $ git log --oneline --first-parent -4 seen 3f252ac9fe Merge branch 'ar/run-command-hook' into seen 672cb7c62e ### CI 3af201233b Merge branch 'bc/sha1-256-interop-02' into seen 950efaac03 Merge branch 'cc/fast-import-strip-if-invalid' into seen It seems that 672cb7c62e (which is an empty commit on top of the merge of v2 of this series) fails win+Meson https://github.com/git/git/actions/runs/19447841443/job/55646336507#step:6:689 but 950efaac03 (which is the merge before v2 of this series is merged to 'seen') is happy with it. https://github.com/git/git/actions/runs/19448271167/job/55647611566 These two runs roughly corresponds to the with=bad/without=good pair in the message you are reponding to, but with the v1 of this series. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 2025-11-18 0:13 ` Junio C Hamano @ 2025-11-19 23:04 ` brian m. carlson 2025-11-19 23:24 ` Junio C Hamano 2025-11-19 23:37 ` Ezekiel Newren 0 siblings, 2 replies; 101+ messages in thread From: brian m. carlson @ 2025-11-19 23:04 UTC (permalink / raw) To: Junio C Hamano; +Cc: git, Patrick Steinhardt, Ezekiel Newren [-- Attachment #1: Type: text/plain, Size: 1222 bytes --] On 2025-11-18 at 00:13:40, Junio C Hamano wrote: > Thanks. > > $ git log --oneline --first-parent -4 seen > 3f252ac9fe Merge branch 'ar/run-command-hook' into seen > 672cb7c62e ### CI > 3af201233b Merge branch 'bc/sha1-256-interop-02' into seen > 950efaac03 Merge branch 'cc/fast-import-strip-if-invalid' into seen > > It seems that 672cb7c62e (which is an empty commit on top of the > merge of v2 of this series) fails win+Meson > > https://github.com/git/git/actions/runs/19447841443/job/55646336507#step:6:689 > > but 950efaac03 (which is the merge before v2 of this series is > merged to 'seen') is happy with it. > > https://github.com/git/git/actions/runs/19448271167/job/55647611566 > > These two runs roughly corresponds to the with=bad/without=good pair > in the message you are reponding to, but with the v1 of this series. Yes, I think we'll need someone familiar with Windows to take a look at that. The message doesn't indicate anything obvious and I don't have any Windows systems available to investigate. My guess is that it's something to do with the build.rs file, but I'm not certain. -- brian m. carlson (they/them) Toronto, Ontario, CA [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 262 bytes --] ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 2025-11-19 23:04 ` brian m. carlson @ 2025-11-19 23:24 ` Junio C Hamano 2025-11-19 23:37 ` Ezekiel Newren 1 sibling, 0 replies; 101+ messages in thread From: Junio C Hamano @ 2025-11-19 23:24 UTC (permalink / raw) To: brian m. carlson; +Cc: git, Patrick Steinhardt, Ezekiel Newren "brian m. carlson" <sandals@crustytoothpaste.net> writes: > On 2025-11-18 at 00:13:40, Junio C Hamano wrote: >> Thanks. >> >> $ git log --oneline --first-parent -4 seen >> 3f252ac9fe Merge branch 'ar/run-command-hook' into seen >> 672cb7c62e ### CI >> 3af201233b Merge branch 'bc/sha1-256-interop-02' into seen >> 950efaac03 Merge branch 'cc/fast-import-strip-if-invalid' into seen >> >> It seems that 672cb7c62e (which is an empty commit on top of the >> merge of v2 of this series) fails win+Meson >> >> https://github.com/git/git/actions/runs/19447841443/job/55646336507#step:6:689 >> >> but 950efaac03 (which is the merge before v2 of this series is >> merged to 'seen') is happy with it. >> >> https://github.com/git/git/actions/runs/19448271167/job/55647611566 >> >> These two runs roughly corresponds to the with=bad/without=good pair >> in the message you are reponding to, but with the v1 of this series. > > Yes, I think we'll need someone familiar with Windows to take a look at > that. The message doesn't indicate anything obvious and I don't have > any Windows systems available to investigate. > > My guess is that it's something to do with the build.rs file, but I'm > not certain. Today's pushout includes jk/ci-windows-meson-test-fix that restores the ability to show the failure log from win+Meson jobs, so we will hopefully see something a bit more usable than what we saw in the previous runs. Thanks. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 2025-11-19 23:04 ` brian m. carlson 2025-11-19 23:24 ` Junio C Hamano @ 2025-11-19 23:37 ` Ezekiel Newren 2025-11-20 19:52 ` Ezekiel Newren 1 sibling, 1 reply; 101+ messages in thread From: Ezekiel Newren @ 2025-11-19 23:37 UTC (permalink / raw) To: brian m. carlson, Junio C Hamano, git, Patrick Steinhardt, Ezekiel Newren On Wed, Nov 19, 2025 at 4:04 PM brian m. carlson <sandals@crustytoothpaste.net> wrote: > > On 2025-11-18 at 00:13:40, Junio C Hamano wrote: > > Thanks. > > > > $ git log --oneline --first-parent -4 seen > > 3f252ac9fe Merge branch 'ar/run-command-hook' into seen > > 672cb7c62e ### CI > > 3af201233b Merge branch 'bc/sha1-256-interop-02' into seen > > 950efaac03 Merge branch 'cc/fast-import-strip-if-invalid' into seen > > > > It seems that 672cb7c62e (which is an empty commit on top of the > > merge of v2 of this series) fails win+Meson > > > > https://github.com/git/git/actions/runs/19447841443/job/55646336507#step:6:689 > > > > but 950efaac03 (which is the merge before v2 of this series is > > merged to 'seen') is happy with it. > > > > https://github.com/git/git/actions/runs/19448271167/job/55647611566 > > > > These two runs roughly corresponds to the with=bad/without=good pair > > in the message you are reponding to, but with the v1 of this series. > > Yes, I think we'll need someone familiar with Windows to take a look at > that. The message doesn't indicate anything obvious and I don't have > any Windows systems available to investigate. > > My guess is that it's something to do with the build.rs file, but I'm > not certain. This was a known issue, that I pointed out, before Patrick's "Introduce Rust" series was merged in [1]. [1] https://lore.kernel.org/git/CAH=ZcbBjL09Mk3AXBSgmZGvmFtU3Roc2P5rbQsZ-U5DBHYSs7w@mail.gmail.com/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 2025-11-19 23:37 ` Ezekiel Newren @ 2025-11-20 19:52 ` Ezekiel Newren 2025-11-20 23:02 ` brian m. carlson 0 siblings, 1 reply; 101+ messages in thread From: Ezekiel Newren @ 2025-11-20 19:52 UTC (permalink / raw) To: brian m. carlson, Junio C Hamano, git, Patrick Steinhardt, Ezekiel Newren On Wed, Nov 19, 2025 at 4:37 PM Ezekiel Newren <ezekielnewren@gmail.com> wrote: > > On Wed, Nov 19, 2025 at 4:04 PM brian m. carlson > <sandals@crustytoothpaste.net> wrote: > > > > On 2025-11-18 at 00:13:40, Junio C Hamano wrote: > > > Thanks. > > > > > > $ git log --oneline --first-parent -4 seen > > > 3f252ac9fe Merge branch 'ar/run-command-hook' into seen > > > 672cb7c62e ### CI > > > 3af201233b Merge branch 'bc/sha1-256-interop-02' into seen > > > 950efaac03 Merge branch 'cc/fast-import-strip-if-invalid' into seen > > > > > > It seems that 672cb7c62e (which is an empty commit on top of the > > > merge of v2 of this series) fails win+Meson > > > > > > https://github.com/git/git/actions/runs/19447841443/job/55646336507#step:6:689 > > > > > > but 950efaac03 (which is the merge before v2 of this series is > > > merged to 'seen') is happy with it. > > > > > > https://github.com/git/git/actions/runs/19448271167/job/55647611566 > > > > > > These two runs roughly corresponds to the with=bad/without=good pair > > > in the message you are reponding to, but with the v1 of this series. > > > > Yes, I think we'll need someone familiar with Windows to take a look at > > that. The message doesn't indicate anything obvious and I don't have > > any Windows systems available to investigate. > > > > My guess is that it's something to do with the build.rs file, but I'm > > not certain. > > This was a known issue, that I pointed out, before Patrick's > "Introduce Rust" series was merged in [1]. > > [1] https://lore.kernel.org/git/CAH=ZcbBjL09Mk3AXBSgmZGvmFtU3Roc2P5rbQsZ-U5DBHYSs7w@mail.gmail.com/ Checkout my retrospective review [1]. Basically if windows + msvc -> <crate>.lib else lib<crate>.a, but it was coded as just if windows -> ... In the github ci these are the only windows combos that are tested. "win build" is windows + gnu + Makefile "win+Meson build" windows + msvc + Meson [1] ci windows problems https://lore.kernel.org/git/CAH=ZcbB8cRgCTp-Q_CxJ4VFNY1+w+C20zgx9bMre4-hNmPrD7g@mail.gmail.com/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 2025-11-20 19:52 ` Ezekiel Newren @ 2025-11-20 23:02 ` brian m. carlson 2025-11-20 23:11 ` Ezekiel Newren 0 siblings, 1 reply; 101+ messages in thread From: brian m. carlson @ 2025-11-20 23:02 UTC (permalink / raw) To: Ezekiel Newren; +Cc: Junio C Hamano, git, Patrick Steinhardt [-- Attachment #1: Type: text/plain, Size: 1636 bytes --] On 2025-11-20 at 19:52:23, Ezekiel Newren wrote: > Checkout my retrospective review [1]. Basically if windows + msvc -> > <crate>.lib else lib<crate>.a, but it was coded as just if windows -> > ... > > In the github ci these are the only windows combos that are tested. > "win build" is windows + gnu + Makefile > "win+Meson build" windows + msvc + Meson So I don't think that fixes the build[0] with this patch: -- %< -- From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001 From: "brian m. carlson" <sandals@crustytoothpaste.net> Date: Thu, 20 Nov 2025 22:52:37 +0000 Subject: [PATCH] WIP: try fixing CI Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> --- Makefile | 2 +- src/cargo-meson.sh | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/Makefile b/Makefile index b05709c5e9..8bdb05e535 100644 --- a/Makefile +++ b/Makefile @@ -934,7 +934,7 @@ else RUST_TARGET_DIR = target/release endif -ifeq ($(uname_S),Windows) +ifdef MSVC RUST_LIB = $(RUST_TARGET_DIR)/gitcore.lib else RUST_LIB = $(RUST_TARGET_DIR)/libgitcore.a diff --git a/src/cargo-meson.sh b/src/cargo-meson.sh index 3998db0435..80c10b22cf 100755 --- a/src/cargo-meson.sh +++ b/src/cargo-meson.sh @@ -27,7 +27,7 @@ then fi case "$(cargo -vV | sed -s 's/^host: \(.*\)$/\1/')" in - *-windows-*) + *-windows-msvc*) LIBNAME=gitcore.lib;; *) LIBNAME=libgitcore.a;; -- 2.51.0.338.gd7d06c2dae8 -- %< -- [0] https://github.com/bk2204/git/actions/runs/19553883891/job/55991786359 -- brian m. carlson (they/them) Toronto, Ontario, CA [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 262 bytes --] ^ permalink raw reply related [flat|nested] 101+ messages in thread
* Re: [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 2025-11-20 23:02 ` brian m. carlson @ 2025-11-20 23:11 ` Ezekiel Newren 2025-11-20 23:14 ` Junio C Hamano 0 siblings, 1 reply; 101+ messages in thread From: Ezekiel Newren @ 2025-11-20 23:11 UTC (permalink / raw) To: brian m. carlson, Ezekiel Newren, Junio C Hamano, git, Patrick Steinhardt On Thu, Nov 20, 2025 at 4:03 PM brian m. carlson <sandals@crustytoothpaste.net> wrote: > > On 2025-11-20 at 19:52:23, Ezekiel Newren wrote: > > Checkout my retrospective review [1]. Basically if windows + msvc -> > > <crate>.lib else lib<crate>.a, but it was coded as just if windows -> > > ... > > > > In the github ci these are the only windows combos that are tested. > > "win build" is windows + gnu + Makefile > > "win+Meson build" windows + msvc + Meson > > So I don't think that fixes the build[0] with this patch: You are correct. It's part of the whole solution. I'm working on ironing out all github ci problems in my cargo-workspace patch series (not yet released). Once I've figured that out I'll publish my series on the mailing list. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 2025-11-20 23:11 ` Ezekiel Newren @ 2025-11-20 23:14 ` Junio C Hamano 0 siblings, 0 replies; 101+ messages in thread From: Junio C Hamano @ 2025-11-20 23:14 UTC (permalink / raw) To: Ezekiel Newren; +Cc: brian m. carlson, git, Patrick Steinhardt Ezekiel Newren <ezekielnewren@gmail.com> writes: > On Thu, Nov 20, 2025 at 4:03 PM brian m. carlson > <sandals@crustytoothpaste.net> wrote: >> >> On 2025-11-20 at 19:52:23, Ezekiel Newren wrote: >> > Checkout my retrospective review [1]. Basically if windows + msvc -> >> > <crate>.lib else lib<crate>.a, but it was coded as just if windows -> >> > ... >> > >> > In the github ci these are the only windows combos that are tested. >> > "win build" is windows + gnu + Makefile >> > "win+Meson build" windows + msvc + Meson >> >> So I don't think that fixes the build[0] with this patch: > > You are correct. It's part of the whole solution. I'm working on > ironing out all github ci problems in my cargo-workspace patch series > (not yet released). Once I've figured that out I'll publish my series > on the mailing list. Thanks for working well together. ^ permalink raw reply [flat|nested] 101+ messages in thread
* [PATCH v2 00/15] SHA-1/SHA-256 interoperability, part 2 2025-10-27 0:43 [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 brian m. carlson ` (16 preceding siblings ...) 2025-11-14 17:25 ` Junio C Hamano @ 2025-11-17 22:16 ` brian m. carlson 2025-11-17 22:16 ` [PATCH v2 01/15] repository: require Rust support for interoperability brian m. carlson ` (14 more replies) 17 siblings, 15 replies; 101+ messages in thread From: brian m. carlson @ 2025-11-17 22:16 UTC (permalink / raw) To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren This is the second part of the SHA-1/SHA-256 interoperability work. It introduces our first major use of Rust code to implement a object map format as well as preparatory work to make that happen, including changing types to more Rust-friendly ones. Since Rust will be required for the interoperability work, we require that in the testsuite. We also verify that our object ID algorithm is valid when looking up data in the hash map since the Rust code intentionally has no knowledge about global mutable state like the_repository and so cannot default to the main hash algorithm when we've zero-initialized a struct object_id. The advantage to this Rust code is that it is comprehensively tested with unit testing. We can serialize our object map and then verify that we can also load it again and perform various testing, such as whether certain object IDs are found in the map and mapped correctly. We can also test our slightly subtle custom binary search code effectively and be confident that it works, since Rust doesn't provide a way to binary search slices of variable length. I have opted not to use an enum type for our hash algorithm and have preserved the use of uint32_t from v1. A C enum type would not map one-to-one with the Rust type (since the C version would use GIT_HASH_UNKNOWN for unknown values and Rust would use None instead), so to avoid problems as we generate more of the integration code with bindgen and cbindgen, I've chosen to leave it as it is. Changes since v1: * Use `MAYBE_UNUSED` instead of casting. * Explain reason for `ObjectID` structure. * Switch to `Result` in hash algorithm abstraction. * Add some additional helpers to `ObjectID`. * Rename function to `hash_algo_ptr_by_number`. * Switch to `xmalloc`. * Fix `build.rs` to use syntax compatible with Rust 1.63. * Remove unneeded libraries from `build.rs`. * Improve Rust documentation. * Explain that safe hashing is about untrusted data, not memory safety. * Add a trait for hashing to allow for future unsafe (trusted data) hashing. * Rename `Hasher` to `CryptoHasher`. * Remove description of legacy loose object map. * Rename loose object map to object map. * Update documentation for object map to be clearer about padding, alignment, and endianness. * Explain which hash algorithm is used in object map. * Remove mention of chunks in object map in favour of generic "additional data". * Fix indentation in object map documentation. * Generally clarify object map documentation. * Fix clippy warnings in Rust code. brian m. carlson (15): repository: require Rust support for interoperability conversion: don't crash when no destination algo hash: use uint32_t for object_id algorithm rust: add a ObjectID struct rust: add a hash algorithm abstraction hash: add a function to look up hash algo structs rust: add additional helpers for ObjectID csum-file: define hashwrite's count as a uint32_t write-or-die: add an fsync component for the object map hash: expose hash context functions to Rust rust: add a build.rs script for tests rust: add functionality to hash an object rust: add a new binary object map format rust: add a small wrapper around the hashfile code object-file-convert: always make sure object ID algo is valid Documentation/gitformat-loose.adoc | 78 +++ Makefile | 5 +- build.rs | 17 + csum-file.c | 2 +- csum-file.h | 2 +- hash.c | 48 +- hash.h | 38 +- object-file-convert.c | 14 +- oidtree.c | 2 +- repository.c | 12 +- repository.h | 4 +- serve.c | 2 +- src/csum_file.rs | 81 +++ src/hash.rs | 466 +++++++++++++++ src/lib.rs | 3 + src/loose.rs | 913 +++++++++++++++++++++++++++++ src/meson.build | 3 + t/t1006-cat-file.sh | 82 ++- t/t1016-compatObjectFormat.sh | 6 + t/t1500-rev-parse.sh | 2 +- t/t9305-fast-import-signatures.sh | 4 +- t/t9350-fast-export.sh | 4 +- t/test-lib.sh | 4 + write-or-die.h | 4 +- 24 files changed, 1722 insertions(+), 74 deletions(-) create mode 100644 build.rs create mode 100644 src/csum_file.rs create mode 100644 src/hash.rs create mode 100644 src/loose.rs ^ permalink raw reply [flat|nested] 101+ messages in thread
* [PATCH v2 01/15] repository: require Rust support for interoperability 2025-11-17 22:16 ` [PATCH v2 00/15] " brian m. carlson @ 2025-11-17 22:16 ` brian m. carlson 2025-11-17 22:16 ` [PATCH v2 02/15] conversion: don't crash when no destination algo brian m. carlson ` (13 subsequent siblings) 14 siblings, 0 replies; 101+ messages in thread From: brian m. carlson @ 2025-11-17 22:16 UTC (permalink / raw) To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren We'll be implementing some of our interoperability code, like the loose object map, in Rust. While the code currently compiles with the old loose object map format, which is written entirely in C, we'll soon replace that with the Rust-based implementation. Require the use of Rust for compatibility mode and die if it is not supported. Because the repo argument is not used when Rust is missing, cast it to void to silence the compiler warning, which we do not care about. Add a prerequisite in our tests, RUST, that checks if Rust functionality is available and use it in the tests that handle interoperability. This is technically a regression in functionality compared to our existing state, but pack index v3 is not yet implemented and thus the functionality is mostly quite broken, which is why we've recently marked this functionality as experimental. We don't believe anyone is getting useful use out of the interoperability code in its current state, so no actual users should be negatively impacted by this change. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> --- repository.c | 8 ++- t/t1006-cat-file.sh | 82 +++++++++++++++++++++---------- t/t1016-compatObjectFormat.sh | 6 +++ t/t1500-rev-parse.sh | 2 +- t/t9305-fast-import-signatures.sh | 4 +- t/t9350-fast-export.sh | 4 +- t/test-lib.sh | 4 ++ 7 files changed, 77 insertions(+), 33 deletions(-) diff --git a/repository.c b/repository.c index 6faf5c7398..186d2c1028 100644 --- a/repository.c +++ b/repository.c @@ -3,6 +3,7 @@ #include "repository.h" #include "odb.h" #include "config.h" +#include "gettext.h" #include "object.h" #include "lockfile.h" #include "path.h" @@ -190,13 +191,18 @@ void repo_set_hash_algo(struct repository *repo, int hash_algo) repo->hash_algo = &hash_algos[hash_algo]; } -void repo_set_compat_hash_algo(struct repository *repo, int algo) +void repo_set_compat_hash_algo(struct repository *repo MAYBE_UNUSED, int algo) { +#ifdef WITH_RUST if (hash_algo_by_ptr(repo->hash_algo) == algo) BUG("hash_algo and compat_hash_algo match"); repo->compat_hash_algo = algo ? &hash_algos[algo] : NULL; if (repo->compat_hash_algo) repo_read_loose_object_map(repo); +#else + if (algo) + die(_("compatibility hash algorithm support requires Rust")); +#endif } void repo_set_ref_storage_format(struct repository *repo, diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh index 1f61b666a7..29a9503523 100755 --- a/t/t1006-cat-file.sh +++ b/t/t1006-cat-file.sh @@ -241,10 +241,16 @@ hello_content="Hello World" hello_size=$(strlen "$hello_content") hello_oid=$(echo_without_newline "$hello_content" | git hash-object --stdin) -test_expect_success "setup" ' +test_expect_success "setup part 1" ' git config core.repositoryformatversion 1 && - git config extensions.objectformat $test_hash_algo && - git config extensions.compatobjectformat $test_compat_hash_algo && + git config extensions.objectformat $test_hash_algo +' + +test_expect_success RUST 'compat setup' ' + git config extensions.compatobjectformat $test_compat_hash_algo +' + +test_expect_success 'setup part 2' ' echo_without_newline "$hello_content" > hello && git update-index --add hello && echo_without_newline "$hello_content" > "path with spaces" && @@ -273,9 +279,13 @@ run_blob_tests () { ' } -hello_compat_oid=$(git rev-parse --output-object-format=$test_compat_hash_algo $hello_oid) run_blob_tests $hello_oid -run_blob_tests $hello_compat_oid + +if test_have_prereq RUST +then + hello_compat_oid=$(git rev-parse --output-object-format=$test_compat_hash_algo $hello_oid) + run_blob_tests $hello_compat_oid +fi test_expect_success '--batch-check without %(rest) considers whole line' ' echo "$hello_oid blob $hello_size" >expect && @@ -286,62 +296,76 @@ test_expect_success '--batch-check without %(rest) considers whole line' ' ' tree_oid=$(git write-tree) -tree_compat_oid=$(git rev-parse --output-object-format=$test_compat_hash_algo $tree_oid) tree_size=$((2 * $(test_oid rawsz) + 13 + 24)) -tree_compat_size=$((2 * $(test_oid --hash=compat rawsz) + 13 + 24)) tree_pretty_content="100644 blob $hello_oid hello${LF}100755 blob $hello_oid path with spaces${LF}" -tree_compat_pretty_content="100644 blob $hello_compat_oid hello${LF}100755 blob $hello_compat_oid path with spaces${LF}" run_tests 'tree' $tree_oid "" $tree_size "" "$tree_pretty_content" -run_tests 'tree' $tree_compat_oid "" $tree_compat_size "" "$tree_compat_pretty_content" run_tests 'blob' "$tree_oid:hello" "100644" $hello_size "" "$hello_content" $hello_oid -run_tests 'blob' "$tree_compat_oid:hello" "100644" $hello_size "" "$hello_content" $hello_compat_oid run_tests 'blob' "$tree_oid:path with spaces" "100755" $hello_size "" "$hello_content" $hello_oid -run_tests 'blob' "$tree_compat_oid:path with spaces" "100755" $hello_size "" "$hello_content" $hello_compat_oid + +if test_have_prereq RUST +then + tree_compat_oid=$(git rev-parse --output-object-format=$test_compat_hash_algo $tree_oid) + tree_compat_size=$((2 * $(test_oid --hash=compat rawsz) + 13 + 24)) + tree_compat_pretty_content="100644 blob $hello_compat_oid hello${LF}100755 blob $hello_compat_oid path with spaces${LF}" + + run_tests 'tree' $tree_compat_oid "" $tree_compat_size "" "$tree_compat_pretty_content" + run_tests 'blob' "$tree_compat_oid:hello" "100644" $hello_size "" "$hello_content" $hello_compat_oid + run_tests 'blob' "$tree_compat_oid:path with spaces" "100755" $hello_size "" "$hello_content" $hello_compat_oid +fi commit_message="Initial commit" commit_oid=$(echo_without_newline "$commit_message" | git commit-tree $tree_oid) -commit_compat_oid=$(git rev-parse --output-object-format=$test_compat_hash_algo $commit_oid) commit_size=$(($(test_oid hexsz) + 137)) -commit_compat_size=$(($(test_oid --hash=compat hexsz) + 137)) commit_content="tree $tree_oid author $GIT_AUTHOR_NAME <$GIT_AUTHOR_EMAIL> $GIT_AUTHOR_DATE committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE $commit_message" -commit_compat_content="tree $tree_compat_oid +run_tests 'commit' $commit_oid "" $commit_size "$commit_content" "$commit_content" + +if test_have_prereq RUST +then + commit_compat_oid=$(git rev-parse --output-object-format=$test_compat_hash_algo $commit_oid) + commit_compat_size=$(($(test_oid --hash=compat hexsz) + 137)) + commit_compat_content="tree $tree_compat_oid author $GIT_AUTHOR_NAME <$GIT_AUTHOR_EMAIL> $GIT_AUTHOR_DATE committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE $commit_message" -run_tests 'commit' $commit_oid "" $commit_size "$commit_content" "$commit_content" -run_tests 'commit' $commit_compat_oid "" $commit_compat_size "$commit_compat_content" "$commit_compat_content" + run_tests 'commit' $commit_compat_oid "" $commit_compat_size "$commit_compat_content" "$commit_compat_content" +fi tag_header_without_oid="type blob tag hellotag tagger $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL>" tag_header_without_timestamp="object $hello_oid $tag_header_without_oid" -tag_compat_header_without_timestamp="object $hello_compat_oid -$tag_header_without_oid" tag_description="This is a tag" tag_content="$tag_header_without_timestamp 0 +0000 -$tag_description" -tag_compat_content="$tag_compat_header_without_timestamp 0 +0000 - $tag_description" tag_oid=$(echo_without_newline "$tag_content" | git hash-object -t tag --stdin -w) tag_size=$(strlen "$tag_content") -tag_compat_oid=$(git rev-parse --output-object-format=$test_compat_hash_algo $tag_oid) -tag_compat_size=$(strlen "$tag_compat_content") - run_tests 'tag' $tag_oid "" $tag_size "$tag_content" "$tag_content" -run_tests 'tag' $tag_compat_oid "" $tag_compat_size "$tag_compat_content" "$tag_compat_content" + +if test_have_prereq RUST +then + tag_compat_header_without_timestamp="object $hello_compat_oid +$tag_header_without_oid" + tag_compat_content="$tag_compat_header_without_timestamp 0 +0000 + +$tag_description" + + tag_compat_oid=$(git rev-parse --output-object-format=$test_compat_hash_algo $tag_oid) + tag_compat_size=$(strlen "$tag_compat_content") + + run_tests 'tag' $tag_compat_oid "" $tag_compat_size "$tag_compat_content" "$tag_compat_content" +fi test_expect_success "Reach a blob from a tag pointing to it" ' echo_without_newline "$hello_content" >expect && @@ -590,7 +614,8 @@ flush" } batch_tests $hello_oid $tree_oid $tree_size $commit_oid $commit_size "$commit_content" $tag_oid $tag_size "$tag_content" -batch_tests $hello_compat_oid $tree_compat_oid $tree_compat_size $commit_compat_oid $commit_compat_size "$commit_compat_content" $tag_compat_oid $tag_compat_size "$tag_compat_content" + +test_have_prereq RUST && batch_tests $hello_compat_oid $tree_compat_oid $tree_compat_size $commit_compat_oid $commit_compat_size "$commit_compat_content" $tag_compat_oid $tag_compat_size "$tag_compat_content" test_expect_success FUNNYNAMES 'setup with newline in input' ' @@ -1226,7 +1251,10 @@ test_expect_success 'batch-check with a submodule' ' test_unconfig extensions.compatobjectformat && printf "160000 commit $(test_oid deadbeef)\tsub\n" >tree-with-sub && tree=$(git mktree <tree-with-sub) && - test_config extensions.compatobjectformat $test_compat_hash_algo && + if test_have_prereq RUST + then + test_config extensions.compatobjectformat $test_compat_hash_algo + fi && git cat-file --batch-check >actual <<-EOF && $tree:sub diff --git a/t/t1016-compatObjectFormat.sh b/t/t1016-compatObjectFormat.sh index 0efce53f3a..92d48b96a1 100755 --- a/t/t1016-compatObjectFormat.sh +++ b/t/t1016-compatObjectFormat.sh @@ -8,6 +8,12 @@ test_description='Test how well compatObjectFormat works' . ./test-lib.sh . "$TEST_DIRECTORY"/lib-gpg.sh +if ! test_have_prereq RUST +then + skip_all='interoperability requires a Git built with Rust' + test_done +fi + # All of the follow variables must be defined in the environment: # GIT_AUTHOR_NAME # GIT_AUTHOR_EMAIL diff --git a/t/t1500-rev-parse.sh b/t/t1500-rev-parse.sh index 7739ab611b..98c5a772bd 100755 --- a/t/t1500-rev-parse.sh +++ b/t/t1500-rev-parse.sh @@ -208,7 +208,7 @@ test_expect_success 'rev-parse --show-object-format in repo' ' ' -test_expect_success 'rev-parse --show-object-format in repo with compat mode' ' +test_expect_success RUST 'rev-parse --show-object-format in repo with compat mode' ' mkdir repo && ( sane_unset GIT_DEFAULT_HASH && diff --git a/t/t9305-fast-import-signatures.sh b/t/t9305-fast-import-signatures.sh index c2b4271658..63c0a2b5c4 100755 --- a/t/t9305-fast-import-signatures.sh +++ b/t/t9305-fast-import-signatures.sh @@ -70,7 +70,7 @@ test_expect_success GPGSSH 'strip SSH signature with --signed-commits=strip' ' test_must_be_empty log ' -test_expect_success GPG 'setup a commit with dual OpenPGP signatures on its SHA-1 and SHA-256 formats' ' +test_expect_success RUST,GPG 'setup a commit with dual OpenPGP signatures on its SHA-1 and SHA-256 formats' ' # Create a signed SHA-256 commit git init --object-format=sha256 explicit-sha256 && git -C explicit-sha256 config extensions.compatObjectFormat sha1 && @@ -91,7 +91,7 @@ test_expect_success GPG 'setup a commit with dual OpenPGP signatures on its SHA- test_grep -E "^gpgsig-sha256 " out ' -test_expect_success GPG 'strip both OpenPGP signatures with --signed-commits=warn-strip' ' +test_expect_success RUST,GPG 'strip both OpenPGP signatures with --signed-commits=warn-strip' ' git -C explicit-sha256 fast-export --signed-commits=verbatim dual-signed >output && test_grep -E "^gpgsig sha1 openpgp" output && test_grep -E "^gpgsig sha256 openpgp" output && diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh index 3d153a4805..784d68b6e5 100755 --- a/t/t9350-fast-export.sh +++ b/t/t9350-fast-export.sh @@ -972,7 +972,7 @@ test_expect_success 'fast-export handles --end-of-options' ' test_cmp expect actual ' -test_expect_success GPG 'setup a commit with dual signatures on its SHA-1 and SHA-256 formats' ' +test_expect_success GPG,RUST 'setup a commit with dual signatures on its SHA-1 and SHA-256 formats' ' # Create a signed SHA-256 commit git init --object-format=sha256 explicit-sha256 && git -C explicit-sha256 config extensions.compatObjectFormat sha1 && @@ -993,7 +993,7 @@ test_expect_success GPG 'setup a commit with dual signatures on its SHA-1 and SH test_grep -E "^gpgsig-sha256 " out ' -test_expect_success GPG 'export and import of doubly signed commit' ' +test_expect_success GPG,RUST 'export and import of doubly signed commit' ' git -C explicit-sha256 fast-export --signed-commits=verbatim dual-signed >output && test_grep -E "^gpgsig sha1 openpgp" output && test_grep -E "^gpgsig sha256 openpgp" output && diff --git a/t/test-lib.sh b/t/test-lib.sh index ef0ab7ec2d..3499a83806 100644 --- a/t/test-lib.sh +++ b/t/test-lib.sh @@ -1890,6 +1890,10 @@ test_lazy_prereq LONG_IS_64BIT ' test 8 -le "$(build_option sizeof-long)" ' +test_lazy_prereq RUST ' + test "$(build_option rust)" = enabled +' + test_lazy_prereq TIME_IS_64BIT 'test-tool date is64bit' test_lazy_prereq TIME_T_IS_64BIT 'test-tool date time_t-is64bit' ^ permalink raw reply related [flat|nested] 101+ messages in thread
* [PATCH v2 02/15] conversion: don't crash when no destination algo 2025-11-17 22:16 ` [PATCH v2 00/15] " brian m. carlson 2025-11-17 22:16 ` [PATCH v2 01/15] repository: require Rust support for interoperability brian m. carlson @ 2025-11-17 22:16 ` brian m. carlson 2025-11-17 22:16 ` [PATCH v2 03/15] hash: use uint32_t for object_id algorithm brian m. carlson ` (12 subsequent siblings) 14 siblings, 0 replies; 101+ messages in thread From: brian m. carlson @ 2025-11-17 22:16 UTC (permalink / raw) To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren When we set up a repository that doesn't have a compatibility hash algorithm, we set the destination algorithm object to NULL. In such a case, we want to silently do nothing instead of crashing, so simply treat the operation as a no-op and copy the object ID. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> --- object-file-convert.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/object-file-convert.c b/object-file-convert.c index 7ab875afe6..e44c821084 100644 --- a/object-file-convert.c +++ b/object-file-convert.c @@ -23,7 +23,7 @@ int repo_oid_to_algop(struct repository *repo, const struct object_id *src, const struct git_hash_algo *from = src->algo ? &hash_algos[src->algo] : repo->hash_algo; - if (from == to) { + if (from == to || !to) { if (src != dest) oidcpy(dest, src); return 0; ^ permalink raw reply related [flat|nested] 101+ messages in thread
* [PATCH v2 03/15] hash: use uint32_t for object_id algorithm 2025-11-17 22:16 ` [PATCH v2 00/15] " brian m. carlson 2025-11-17 22:16 ` [PATCH v2 01/15] repository: require Rust support for interoperability brian m. carlson 2025-11-17 22:16 ` [PATCH v2 02/15] conversion: don't crash when no destination algo brian m. carlson @ 2025-11-17 22:16 ` brian m. carlson 2025-11-17 22:16 ` [PATCH v2 04/15] rust: add a ObjectID struct brian m. carlson ` (11 subsequent siblings) 14 siblings, 0 replies; 101+ messages in thread From: brian m. carlson @ 2025-11-17 22:16 UTC (permalink / raw) To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren We currently use an int for this value, but we'll define this structure from Rust in a future commit and we want to ensure that our data types are exactly identical. To make that possible, use a uint32_t for the hash algorithm. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> --- hash.c | 6 +++--- hash.h | 10 +++++----- oidtree.c | 2 +- repository.c | 6 +++--- repository.h | 4 ++-- serve.c | 2 +- 6 files changed, 15 insertions(+), 15 deletions(-) diff --git a/hash.c b/hash.c index 4a04ecb50e..81b4f87027 100644 --- a/hash.c +++ b/hash.c @@ -241,7 +241,7 @@ const char *empty_tree_oid_hex(const struct git_hash_algo *algop) return oid_to_hex_r(buf, algop->empty_tree); } -int hash_algo_by_name(const char *name) +uint32_t hash_algo_by_name(const char *name) { if (!name) return GIT_HASH_UNKNOWN; @@ -251,7 +251,7 @@ int hash_algo_by_name(const char *name) return GIT_HASH_UNKNOWN; } -int hash_algo_by_id(uint32_t format_id) +uint32_t hash_algo_by_id(uint32_t format_id) { for (size_t i = 1; i < GIT_HASH_NALGOS; i++) if (format_id == hash_algos[i].format_id) @@ -259,7 +259,7 @@ int hash_algo_by_id(uint32_t format_id) return GIT_HASH_UNKNOWN; } -int hash_algo_by_length(size_t len) +uint32_t hash_algo_by_length(size_t len) { for (size_t i = 1; i < GIT_HASH_NALGOS; i++) if (len == hash_algos[i].rawsz) diff --git a/hash.h b/hash.h index fae966b23c..99c9c2a0a8 100644 --- a/hash.h +++ b/hash.h @@ -211,7 +211,7 @@ static inline void git_SHA256_Clone(git_SHA256_CTX *dst, const git_SHA256_CTX *s struct object_id { unsigned char hash[GIT_MAX_RAWSZ]; - int algo; /* XXX requires 4-byte alignment */ + uint32_t algo; /* XXX requires 4-byte alignment */ }; #define GET_OID_QUIETLY 01 @@ -344,13 +344,13 @@ static inline void git_hash_final_oid(struct object_id *oid, struct git_hash_ctx * Return a GIT_HASH_* constant based on the name. Returns GIT_HASH_UNKNOWN if * the name doesn't match a known algorithm. */ -int hash_algo_by_name(const char *name); +uint32_t hash_algo_by_name(const char *name); /* Identical, except based on the format ID. */ -int hash_algo_by_id(uint32_t format_id); +uint32_t hash_algo_by_id(uint32_t format_id); /* Identical, except based on the length. */ -int hash_algo_by_length(size_t len); +uint32_t hash_algo_by_length(size_t len); /* Identical, except for a pointer to struct git_hash_algo. */ -static inline int hash_algo_by_ptr(const struct git_hash_algo *p) +static inline uint32_t hash_algo_by_ptr(const struct git_hash_algo *p) { size_t i; for (i = 0; i < GIT_HASH_NALGOS; i++) { diff --git a/oidtree.c b/oidtree.c index 151568f74f..324de94934 100644 --- a/oidtree.c +++ b/oidtree.c @@ -10,7 +10,7 @@ struct oidtree_iter_data { oidtree_iter fn; void *arg; size_t *last_nibble_at; - int algo; + uint32_t algo; uint8_t last_byte; }; diff --git a/repository.c b/repository.c index 186d2c1028..ebe719de3c 100644 --- a/repository.c +++ b/repository.c @@ -39,7 +39,7 @@ struct repository *the_repository = &the_repo; static void set_default_hash_algo(struct repository *repo) { const char *hash_name; - int algo; + uint32_t algo; hash_name = getenv("GIT_TEST_DEFAULT_HASH_ALGO"); if (!hash_name) @@ -186,12 +186,12 @@ void repo_set_gitdir(struct repository *repo, repo->gitdir, "index"); } -void repo_set_hash_algo(struct repository *repo, int hash_algo) +void repo_set_hash_algo(struct repository *repo, uint32_t hash_algo) { repo->hash_algo = &hash_algos[hash_algo]; } -void repo_set_compat_hash_algo(struct repository *repo MAYBE_UNUSED, int algo) +void repo_set_compat_hash_algo(struct repository *repo MAYBE_UNUSED, uint32_t algo) { #ifdef WITH_RUST if (hash_algo_by_ptr(repo->hash_algo) == algo) diff --git a/repository.h b/repository.h index 5808a5d610..c0a3543b24 100644 --- a/repository.h +++ b/repository.h @@ -193,8 +193,8 @@ struct set_gitdir_args { void repo_set_gitdir(struct repository *repo, const char *root, const struct set_gitdir_args *extra_args); void repo_set_worktree(struct repository *repo, const char *path); -void repo_set_hash_algo(struct repository *repo, int algo); -void repo_set_compat_hash_algo(struct repository *repo, int compat_algo); +void repo_set_hash_algo(struct repository *repo, uint32_t algo); +void repo_set_compat_hash_algo(struct repository *repo, uint32_t compat_algo); void repo_set_ref_storage_format(struct repository *repo, enum ref_storage_format format); void initialize_repository(struct repository *repo); diff --git a/serve.c b/serve.c index 53ecab3b42..49a6e39b1d 100644 --- a/serve.c +++ b/serve.c @@ -14,7 +14,7 @@ static int advertise_sid = -1; static int advertise_object_info = -1; -static int client_hash_algo = GIT_HASH_SHA1_LEGACY; +static uint32_t client_hash_algo = GIT_HASH_SHA1_LEGACY; static int always_advertise(struct repository *r UNUSED, struct strbuf *value UNUSED) ^ permalink raw reply related [flat|nested] 101+ messages in thread
* [PATCH v2 04/15] rust: add a ObjectID struct 2025-11-17 22:16 ` [PATCH v2 00/15] " brian m. carlson ` (2 preceding siblings ...) 2025-11-17 22:16 ` [PATCH v2 03/15] hash: use uint32_t for object_id algorithm brian m. carlson @ 2025-11-17 22:16 ` brian m. carlson 2025-11-17 22:16 ` [PATCH v2 05/15] rust: add a hash algorithm abstraction brian m. carlson ` (10 subsequent siblings) 14 siblings, 0 replies; 101+ messages in thread From: brian m. carlson @ 2025-11-17 22:16 UTC (permalink / raw) To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren We'd like to be able to write some Rust code that can work with object IDs. Add a structure here that's identical to struct object_id in C, for easy use in sharing across the FFI boundary. We will use this structure in several places in hot paths, such as index-pack or pack-objects when converting between algorithms, so prioritize efficient interchange over a more idiomatic Rust approach. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> --- Makefile | 1 + src/hash.rs | 21 +++++++++++++++++++++ src/lib.rs | 1 + src/meson.build | 1 + 4 files changed, 24 insertions(+) create mode 100644 src/hash.rs diff --git a/Makefile b/Makefile index 7e0f77e298..e1d0ae3691 100644 --- a/Makefile +++ b/Makefile @@ -1534,6 +1534,7 @@ CLAR_TEST_OBJS += $(UNIT_TEST_DIR)/unit-test.o UNIT_TEST_OBJS += $(UNIT_TEST_DIR)/test-lib.o +RUST_SOURCES += src/hash.rs RUST_SOURCES += src/lib.rs RUST_SOURCES += src/varint.rs diff --git a/src/hash.rs b/src/hash.rs new file mode 100644 index 0000000000..0219391820 --- /dev/null +++ b/src/hash.rs @@ -0,0 +1,21 @@ +// This program is free software; you can redistribute it and/or modify +// it under the terms of the GNU General Public License as published by +// the Free Software Foundation: version 2 of the License, dated June 1991. +// +// This program is distributed in the hope that it will be useful, +// but WITHOUT ANY WARRANTY; without even the implied warranty of +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +// GNU General Public License for more details. +// +// You should have received a copy of the GNU General Public License along +// with this program; if not, see <https://www.gnu.org/licenses/>. + +pub const GIT_MAX_RAWSZ: usize = 32; + +/// A binary object ID. +#[repr(C)] +#[derive(Debug, Clone, Ord, PartialOrd, Eq, PartialEq)] +pub struct ObjectID { + pub hash: [u8; GIT_MAX_RAWSZ], + pub algo: u32, +} diff --git a/src/lib.rs b/src/lib.rs index 9da70d8b57..cf7c962509 100644 --- a/src/lib.rs +++ b/src/lib.rs @@ -1 +1,2 @@ +pub mod hash; pub mod varint; diff --git a/src/meson.build b/src/meson.build index 25b9ad5a14..c77041a3fa 100644 --- a/src/meson.build +++ b/src/meson.build @@ -1,4 +1,5 @@ libgit_rs_sources = [ + 'hash.rs', 'lib.rs', 'varint.rs', ] ^ permalink raw reply related [flat|nested] 101+ messages in thread
* [PATCH v2 05/15] rust: add a hash algorithm abstraction 2025-11-17 22:16 ` [PATCH v2 00/15] " brian m. carlson ` (3 preceding siblings ...) 2025-11-17 22:16 ` [PATCH v2 04/15] rust: add a ObjectID struct brian m. carlson @ 2025-11-17 22:16 ` brian m. carlson 2025-11-17 22:16 ` [PATCH v2 06/15] hash: add a function to look up hash algo structs brian m. carlson ` (9 subsequent siblings) 14 siblings, 0 replies; 101+ messages in thread From: brian m. carlson @ 2025-11-17 22:16 UTC (permalink / raw) To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren This works very similarly to the existing one in C except that it doesn't provide any functionality to hash an object. We don't currently need that right now, but the use of those function pointers do make it substantially more difficult to write a bit-for-bit identical structure across the C/Rust interface, so omit them for now. Instead of the more customary "&self", use "self", because the former is the size of a pointer and the latter is the size of an integer on most systems. Don't define an unknown value but use an Option for that instead. Update the object ID structure to allow slicing the data appropriately for the algorithm. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> --- src/hash.rs | 159 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 159 insertions(+) diff --git a/src/hash.rs b/src/hash.rs index 0219391820..0ec0ab0490 100644 --- a/src/hash.rs +++ b/src/hash.rs @@ -10,8 +10,25 @@ // You should have received a copy of the GNU General Public License along // with this program; if not, see <https://www.gnu.org/licenses/>. +use std::error::Error; +use std::fmt::{self, Debug, Display}; + pub const GIT_MAX_RAWSZ: usize = 32; +/// An error indicating an invalid hash algorithm. +/// +/// The contained `u32` is the same as the `algo` field in `ObjectID`. +#[derive(Debug, Copy, Clone)] +pub struct InvalidHashAlgorithm(pub u32); + +impl Display for InvalidHashAlgorithm { + fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { + write!(f, "invalid hash algorithm {}", self.0) + } +} + +impl Error for InvalidHashAlgorithm {} + /// A binary object ID. #[repr(C)] #[derive(Debug, Clone, Ord, PartialOrd, Eq, PartialEq)] @@ -19,3 +36,145 @@ pub struct ObjectID { pub hash: [u8; GIT_MAX_RAWSZ], pub algo: u32, } + +#[allow(dead_code)] +impl ObjectID { + pub fn as_slice(&self) -> Result<&[u8], InvalidHashAlgorithm> { + match HashAlgorithm::from_u32(self.algo) { + Some(algo) => Ok(&self.hash[0..algo.raw_len()]), + None => Err(InvalidHashAlgorithm(self.algo)), + } + } + + pub fn as_mut_slice(&mut self) -> Result<&mut [u8], InvalidHashAlgorithm> { + match HashAlgorithm::from_u32(self.algo) { + Some(algo) => Ok(&mut self.hash[0..algo.raw_len()]), + None => Err(InvalidHashAlgorithm(self.algo)), + } + } +} + +/// A hash algorithm, +#[repr(C)] +#[derive(Debug, Copy, Clone, Ord, PartialOrd, Eq, PartialEq)] +pub enum HashAlgorithm { + SHA1 = 1, + SHA256 = 2, +} + +#[allow(dead_code)] +impl HashAlgorithm { + const SHA1_NULL_OID: ObjectID = ObjectID { + hash: [0u8; 32], + algo: Self::SHA1 as u32, + }; + const SHA256_NULL_OID: ObjectID = ObjectID { + hash: [0u8; 32], + algo: Self::SHA256 as u32, + }; + + const SHA1_EMPTY_TREE: ObjectID = ObjectID { + hash: *b"\x4b\x82\x5d\xc6\x42\xcb\x6e\xb9\xa0\x60\xe5\x4b\xf8\xd6\x92\x88\xfb\xee\x49\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00", + algo: Self::SHA1 as u32, + }; + const SHA256_EMPTY_TREE: ObjectID = ObjectID { + hash: *b"\x6e\xf1\x9b\x41\x22\x5c\x53\x69\xf1\xc1\x04\xd4\x5d\x8d\x85\xef\xa9\xb0\x57\xb5\x3b\x14\xb4\xb9\xb9\x39\xdd\x74\xde\xcc\x53\x21", + algo: Self::SHA256 as u32, + }; + + const SHA1_EMPTY_BLOB: ObjectID = ObjectID { + hash: *b"\xe6\x9d\xe2\x9b\xb2\xd1\xd6\x43\x4b\x8b\x29\xae\x77\x5a\xd8\xc2\xe4\x8c\x53\x91\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00", + algo: Self::SHA1 as u32, + }; + const SHA256_EMPTY_BLOB: ObjectID = ObjectID { + hash: *b"\x47\x3a\x0f\x4c\x3b\xe8\xa9\x36\x81\xa2\x67\xe3\xb1\xe9\xa7\xdc\xda\x11\x85\x43\x6f\xe1\x41\xf7\x74\x91\x20\xa3\x03\x72\x18\x13", + algo: Self::SHA256 as u32, + }; + + /// Return a hash algorithm based on the internal integer ID used by Git. + /// + /// Returns `None` if the algorithm doesn't indicate a valid algorithm. + pub const fn from_u32(algo: u32) -> Option<HashAlgorithm> { + match algo { + 1 => Some(HashAlgorithm::SHA1), + 2 => Some(HashAlgorithm::SHA256), + _ => None, + } + } + + /// Return a hash algorithm based on the internal integer ID used by Git. + /// + /// Returns `None` if the algorithm doesn't indicate a valid algorithm. + pub const fn from_format_id(algo: u32) -> Option<HashAlgorithm> { + match algo { + 0x73686131 => Some(HashAlgorithm::SHA1), + 0x73323536 => Some(HashAlgorithm::SHA256), + _ => None, + } + } + + /// The name of this hash algorithm as a string suitable for the configuration file. + pub const fn name(self) -> &'static str { + match self { + HashAlgorithm::SHA1 => "sha1", + HashAlgorithm::SHA256 => "sha256", + } + } + + /// The format ID of this algorithm for binary formats. + /// + /// Note that when writing this to a data format, it should be written in big-endian format + /// explicitly. + pub const fn format_id(self) -> u32 { + match self { + HashAlgorithm::SHA1 => 0x73686131, + HashAlgorithm::SHA256 => 0x73323536, + } + } + + /// The length of binary object IDs in this algorithm in bytes. + pub const fn raw_len(self) -> usize { + match self { + HashAlgorithm::SHA1 => 20, + HashAlgorithm::SHA256 => 32, + } + } + + /// The length of object IDs in this algorithm in hexadecimal characters. + pub const fn hex_len(self) -> usize { + self.raw_len() * 2 + } + + /// The number of bytes which is processed by one iteration of this algorithm's compression + /// function. + pub const fn block_size(self) -> usize { + match self { + HashAlgorithm::SHA1 => 64, + HashAlgorithm::SHA256 => 64, + } + } + + /// The object ID representing the empty blob. + pub const fn empty_blob(self) -> &'static ObjectID { + match self { + HashAlgorithm::SHA1 => &Self::SHA1_EMPTY_BLOB, + HashAlgorithm::SHA256 => &Self::SHA256_EMPTY_BLOB, + } + } + + /// The object ID representing the empty tree. + pub const fn empty_tree(self) -> &'static ObjectID { + match self { + HashAlgorithm::SHA1 => &Self::SHA1_EMPTY_TREE, + HashAlgorithm::SHA256 => &Self::SHA256_EMPTY_TREE, + } + } + + /// The object ID which is all zeros. + pub const fn null_oid(self) -> &'static ObjectID { + match self { + HashAlgorithm::SHA1 => &Self::SHA1_NULL_OID, + HashAlgorithm::SHA256 => &Self::SHA256_NULL_OID, + } + } +} ^ permalink raw reply related [flat|nested] 101+ messages in thread
* [PATCH v2 06/15] hash: add a function to look up hash algo structs 2025-11-17 22:16 ` [PATCH v2 00/15] " brian m. carlson ` (4 preceding siblings ...) 2025-11-17 22:16 ` [PATCH v2 05/15] rust: add a hash algorithm abstraction brian m. carlson @ 2025-11-17 22:16 ` brian m. carlson 2025-11-17 22:16 ` [PATCH v2 07/15] rust: add additional helpers for ObjectID brian m. carlson ` (8 subsequent siblings) 14 siblings, 0 replies; 101+ messages in thread From: brian m. carlson @ 2025-11-17 22:16 UTC (permalink / raw) To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren In C, it's easy for us to look up a hash algorithm structure by its offset by simply indexing the hash_algos array. However, in Rust, we sometimes need a pointer to pass to a C function, but we have our own hash algorithm abstraction. To get one from the other, let's provide a simple function that looks up the C structure from the offset and expose it in Rust. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> --- hash.c | 7 +++++++ hash.h | 1 + src/hash.rs | 14 ++++++++++++++ 3 files changed, 22 insertions(+) diff --git a/hash.c b/hash.c index 81b4f87027..97fd473607 100644 --- a/hash.c +++ b/hash.c @@ -241,6 +241,13 @@ const char *empty_tree_oid_hex(const struct git_hash_algo *algop) return oid_to_hex_r(buf, algop->empty_tree); } +const struct git_hash_algo *hash_algo_ptr_by_number(uint32_t algo) +{ + if (algo >= GIT_HASH_NALGOS) + return NULL; + return &hash_algos[algo]; +} + uint32_t hash_algo_by_name(const char *name) { if (!name) diff --git a/hash.h b/hash.h index 99c9c2a0a8..709d7585a5 100644 --- a/hash.h +++ b/hash.h @@ -340,6 +340,7 @@ static inline void git_hash_final_oid(struct object_id *oid, struct git_hash_ctx ctx->algop->final_oid_fn(oid, ctx); } +const struct git_hash_algo *hash_algo_ptr_by_number(uint32_t algo); /* * Return a GIT_HASH_* constant based on the name. Returns GIT_HASH_UNKNOWN if * the name doesn't match a known algorithm. diff --git a/src/hash.rs b/src/hash.rs index 0ec0ab0490..70bb8095e8 100644 --- a/src/hash.rs +++ b/src/hash.rs @@ -12,6 +12,7 @@ use std::error::Error; use std::fmt::{self, Debug, Display}; +use std::os::raw::c_void; pub const GIT_MAX_RAWSZ: usize = 32; @@ -177,4 +178,17 @@ impl HashAlgorithm { HashAlgorithm::SHA256 => &Self::SHA256_NULL_OID, } } + + /// A pointer to the C `struct git_hash_algo` for interoperability with C. + pub fn hash_algo_ptr(self) -> *const c_void { + unsafe { c::hash_algo_ptr_by_number(self as u32) } + } +} + +pub mod c { + use std::os::raw::c_void; + + extern "C" { + pub fn hash_algo_ptr_by_number(n: u32) -> *const c_void; + } } ^ permalink raw reply related [flat|nested] 101+ messages in thread
* [PATCH v2 07/15] rust: add additional helpers for ObjectID 2025-11-17 22:16 ` [PATCH v2 00/15] " brian m. carlson ` (5 preceding siblings ...) 2025-11-17 22:16 ` [PATCH v2 06/15] hash: add a function to look up hash algo structs brian m. carlson @ 2025-11-17 22:16 ` brian m. carlson 2025-11-17 22:16 ` [PATCH v2 08/15] csum-file: define hashwrite's count as a uint32_t brian m. carlson ` (7 subsequent siblings) 14 siblings, 0 replies; 101+ messages in thread From: brian m. carlson @ 2025-11-17 22:16 UTC (permalink / raw) To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren Right now, users can internally access the contents of the ObjectID struct, which can lead to data that is not valid, such as invalid algorithms or non-zero-padded hash values. These can cause problems down the line as we use them more. Add a constructor for ObjectID that allows us to set these values and also provide an accessor for the algorithm so that we can access it. In addition, provide useful Display and Debug implementations that can format our data in a useful way. Now that we have the ability to work with these various components in a nice way, add some tests as well to make sure that ObjectID and HashAlgorithm work together as expected. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> --- src/hash.rs | 133 +++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 132 insertions(+), 1 deletion(-) diff --git a/src/hash.rs b/src/hash.rs index 70bb8095e8..e1fa568661 100644 --- a/src/hash.rs +++ b/src/hash.rs @@ -32,7 +32,7 @@ impl Error for InvalidHashAlgorithm {} /// A binary object ID. #[repr(C)] -#[derive(Debug, Clone, Ord, PartialOrd, Eq, PartialEq)] +#[derive(Clone, Ord, PartialOrd, Eq, PartialEq)] pub struct ObjectID { pub hash: [u8; GIT_MAX_RAWSZ], pub algo: u32, @@ -40,6 +40,27 @@ pub struct ObjectID { #[allow(dead_code)] impl ObjectID { + /// Return a new object ID with the given algorithm and hash. + /// + /// `hash` must be exactly the proper length for `algo` and this function panics if it is not. + /// The extra internal storage of `hash`, if any, is zero filled. + pub fn new(algo: HashAlgorithm, hash: &[u8]) -> Self { + let mut data = [0u8; GIT_MAX_RAWSZ]; + // This verifies that the length of `hash` is correct. + data[0..algo.raw_len()].copy_from_slice(hash); + Self { + hash: data, + algo: algo as u32, + } + } + + /// Return the algorithm for this object ID. + /// + /// If the algorithm set internally is not valid, this function panics. + pub fn algo(&self) -> Result<HashAlgorithm, InvalidHashAlgorithm> { + HashAlgorithm::from_u32(self.algo).ok_or(InvalidHashAlgorithm(self.algo)) + } + pub fn as_slice(&self) -> Result<&[u8], InvalidHashAlgorithm> { match HashAlgorithm::from_u32(self.algo) { Some(algo) => Ok(&self.hash[0..algo.raw_len()]), @@ -55,6 +76,41 @@ impl ObjectID { } } +impl Display for ObjectID { + /// Format this object ID as a hex object ID. + fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { + let hash = self.as_slice().unwrap(); + for x in hash { + write!(f, "{:02x}", x)?; + } + Ok(()) + } +} + +impl Debug for ObjectID { + /// Format this object ID as a hex object ID with a colon and name appended to it. + /// + /// ``` + /// assert_eq!( + /// format!("{:?}", HashAlgorithm::SHA256.null_oid()), + /// "0000000000000000000000000000000000000000000000000000000000000000:sha256" + /// ); + /// ``` + fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { + let hash = match self.as_slice() { + Ok(hash) => hash, + Err(_) => &self.hash, + }; + for x in hash { + write!(f, "{:02x}", x)?; + } + match self.algo() { + Ok(algo) => write!(f, ":{}", algo.name()), + Err(e) => write!(f, ":invalid-hash-algo-{}", e.0), + } + } +} + /// A hash algorithm, #[repr(C)] #[derive(Debug, Copy, Clone, Ord, PartialOrd, Eq, PartialEq)] @@ -192,3 +248,78 @@ pub mod c { pub fn hash_algo_ptr_by_number(n: u32) -> *const c_void; } } + +#[cfg(test)] +mod tests { + use super::HashAlgorithm; + + fn all_algos() -> &'static [HashAlgorithm] { + &[HashAlgorithm::SHA1, HashAlgorithm::SHA256] + } + + #[test] + fn format_id_round_trips() { + for algo in all_algos() { + assert_eq!( + *algo, + HashAlgorithm::from_format_id(algo.format_id()).unwrap() + ); + } + } + + #[test] + fn offset_round_trips() { + for algo in all_algos() { + assert_eq!(*algo, HashAlgorithm::from_u32(*algo as u32).unwrap()); + } + } + + #[test] + fn slices_have_correct_length() { + for algo in all_algos() { + for oid in [algo.null_oid(), algo.empty_blob(), algo.empty_tree()] { + assert_eq!(oid.as_slice().unwrap().len(), algo.raw_len()); + } + } + } + + #[test] + fn object_ids_format_correctly() { + let entries = &[ + ( + HashAlgorithm::SHA1.null_oid(), + "0000000000000000000000000000000000000000", + "0000000000000000000000000000000000000000:sha1", + ), + ( + HashAlgorithm::SHA1.empty_blob(), + "e69de29bb2d1d6434b8b29ae775ad8c2e48c5391", + "e69de29bb2d1d6434b8b29ae775ad8c2e48c5391:sha1", + ), + ( + HashAlgorithm::SHA1.empty_tree(), + "4b825dc642cb6eb9a060e54bf8d69288fbee4904", + "4b825dc642cb6eb9a060e54bf8d69288fbee4904:sha1", + ), + ( + HashAlgorithm::SHA256.null_oid(), + "0000000000000000000000000000000000000000000000000000000000000000", + "0000000000000000000000000000000000000000000000000000000000000000:sha256", + ), + ( + HashAlgorithm::SHA256.empty_blob(), + "473a0f4c3be8a93681a267e3b1e9a7dcda1185436fe141f7749120a303721813", + "473a0f4c3be8a93681a267e3b1e9a7dcda1185436fe141f7749120a303721813:sha256", + ), + ( + HashAlgorithm::SHA256.empty_tree(), + "6ef19b41225c5369f1c104d45d8d85efa9b057b53b14b4b9b939dd74decc5321", + "6ef19b41225c5369f1c104d45d8d85efa9b057b53b14b4b9b939dd74decc5321:sha256", + ), + ]; + for (oid, display, debug) in entries { + assert_eq!(format!("{}", oid), *display); + assert_eq!(format!("{:?}", oid), *debug); + } + } +} ^ permalink raw reply related [flat|nested] 101+ messages in thread
* [PATCH v2 08/15] csum-file: define hashwrite's count as a uint32_t 2025-11-17 22:16 ` [PATCH v2 00/15] " brian m. carlson ` (6 preceding siblings ...) 2025-11-17 22:16 ` [PATCH v2 07/15] rust: add additional helpers for ObjectID brian m. carlson @ 2025-11-17 22:16 ` brian m. carlson 2025-11-17 22:16 ` [PATCH v2 09/15] write-or-die: add an fsync component for the object map brian m. carlson ` (6 subsequent siblings) 14 siblings, 0 replies; 101+ messages in thread From: brian m. carlson @ 2025-11-17 22:16 UTC (permalink / raw) To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren We want to call this code from Rust and ensure that the types are the same for compatibility, which is easiest to do if the type is a fixed size. Since unsigned int is 32 bits on all the platforms we care about, define it as a uint32_t instead. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> --- csum-file.c | 2 +- csum-file.h | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/csum-file.c b/csum-file.c index 6e21e3cac8..3d3047c776 100644 --- a/csum-file.c +++ b/csum-file.c @@ -110,7 +110,7 @@ void discard_hashfile(struct hashfile *f) free_hashfile(f); } -void hashwrite(struct hashfile *f, const void *buf, unsigned int count) +void hashwrite(struct hashfile *f, const void *buf, uint32_t count) { while (count) { unsigned left = f->buffer_len - f->offset; diff --git a/csum-file.h b/csum-file.h index 07ae11024a..ecce9d27b0 100644 --- a/csum-file.h +++ b/csum-file.h @@ -63,7 +63,7 @@ void free_hashfile(struct hashfile *f); */ int finalize_hashfile(struct hashfile *, unsigned char *, enum fsync_component, unsigned int); void discard_hashfile(struct hashfile *); -void hashwrite(struct hashfile *, const void *, unsigned int); +void hashwrite(struct hashfile *, const void *, uint32_t); void hashflush(struct hashfile *f); void crc32_begin(struct hashfile *); uint32_t crc32_end(struct hashfile *); ^ permalink raw reply related [flat|nested] 101+ messages in thread
* [PATCH v2 09/15] write-or-die: add an fsync component for the object map 2025-11-17 22:16 ` [PATCH v2 00/15] " brian m. carlson ` (7 preceding siblings ...) 2025-11-17 22:16 ` [PATCH v2 08/15] csum-file: define hashwrite's count as a uint32_t brian m. carlson @ 2025-11-17 22:16 ` brian m. carlson 2025-11-17 22:16 ` [PATCH v2 10/15] hash: expose hash context functions to Rust brian m. carlson ` (5 subsequent siblings) 14 siblings, 0 replies; 101+ messages in thread From: brian m. carlson @ 2025-11-17 22:16 UTC (permalink / raw) To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren We'll soon be writing out an object map using the hashfile code. Add an fsync component to allow us to handle fsyncing it correctly. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> --- write-or-die.h | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/write-or-die.h b/write-or-die.h index 65a5c42a47..ff0408bd84 100644 --- a/write-or-die.h +++ b/write-or-die.h @@ -21,6 +21,7 @@ enum fsync_component { FSYNC_COMPONENT_COMMIT_GRAPH = 1 << 3, FSYNC_COMPONENT_INDEX = 1 << 4, FSYNC_COMPONENT_REFERENCE = 1 << 5, + FSYNC_COMPONENT_OBJECT_MAP = 1 << 6, }; #define FSYNC_COMPONENTS_OBJECTS (FSYNC_COMPONENT_LOOSE_OBJECT | \ @@ -44,7 +45,8 @@ enum fsync_component { FSYNC_COMPONENT_PACK_METADATA | \ FSYNC_COMPONENT_COMMIT_GRAPH | \ FSYNC_COMPONENT_INDEX | \ - FSYNC_COMPONENT_REFERENCE) + FSYNC_COMPONENT_REFERENCE | \ + FSYNC_COMPONENT_OBJECT_MAP) #ifndef FSYNC_COMPONENTS_PLATFORM_DEFAULT #define FSYNC_COMPONENTS_PLATFORM_DEFAULT FSYNC_COMPONENTS_DEFAULT ^ permalink raw reply related [flat|nested] 101+ messages in thread
* [PATCH v2 10/15] hash: expose hash context functions to Rust 2025-11-17 22:16 ` [PATCH v2 00/15] " brian m. carlson ` (8 preceding siblings ...) 2025-11-17 22:16 ` [PATCH v2 09/15] write-or-die: add an fsync component for the object map brian m. carlson @ 2025-11-17 22:16 ` brian m. carlson 2025-11-17 22:16 ` [PATCH v2 11/15] rust: add a build.rs script for tests brian m. carlson ` (4 subsequent siblings) 14 siblings, 0 replies; 101+ messages in thread From: brian m. carlson @ 2025-11-17 22:16 UTC (permalink / raw) To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren We'd like to be able to hash our data in Rust using the same contexts as in C. However, we need our helper functions to not be inline so they can be linked into the binary appropriately. In addition, to avoid managing memory manually and since we don't know the size of the hash context structure, we want to have simple alloc and free functions we can use to make sure a context can be easily dynamically created. Expose the helper functions and create alloc, free, and init functions we can call. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> --- hash.c | 35 +++++++++++++++++++++++++++++++++++ hash.h | 27 +++++++-------------------- 2 files changed, 42 insertions(+), 20 deletions(-) diff --git a/hash.c b/hash.c index 97fd473607..553f2008ea 100644 --- a/hash.c +++ b/hash.c @@ -248,6 +248,41 @@ const struct git_hash_algo *hash_algo_ptr_by_number(uint32_t algo) return &hash_algos[algo]; } +struct git_hash_ctx *git_hash_alloc(void) +{ + return xmalloc(sizeof(struct git_hash_ctx)); +} + +void git_hash_free(struct git_hash_ctx *ctx) +{ + free(ctx); +} + +void git_hash_init(struct git_hash_ctx *ctx, const struct git_hash_algo *algop) +{ + algop->init_fn(ctx); +} + +void git_hash_clone(struct git_hash_ctx *dst, const struct git_hash_ctx *src) +{ + src->algop->clone_fn(dst, src); +} + +void git_hash_update(struct git_hash_ctx *ctx, const void *in, size_t len) +{ + ctx->algop->update_fn(ctx, in, len); +} + +void git_hash_final(unsigned char *hash, struct git_hash_ctx *ctx) +{ + ctx->algop->final_fn(hash, ctx); +} + +void git_hash_final_oid(struct object_id *oid, struct git_hash_ctx *ctx) +{ + ctx->algop->final_oid_fn(oid, ctx); +} + uint32_t hash_algo_by_name(const char *name) { if (!name) diff --git a/hash.h b/hash.h index 709d7585a5..d51efce1d3 100644 --- a/hash.h +++ b/hash.h @@ -320,27 +320,14 @@ struct git_hash_algo { }; extern const struct git_hash_algo hash_algos[GIT_HASH_NALGOS]; -static inline void git_hash_clone(struct git_hash_ctx *dst, const struct git_hash_ctx *src) -{ - src->algop->clone_fn(dst, src); -} - -static inline void git_hash_update(struct git_hash_ctx *ctx, const void *in, size_t len) -{ - ctx->algop->update_fn(ctx, in, len); -} - -static inline void git_hash_final(unsigned char *hash, struct git_hash_ctx *ctx) -{ - ctx->algop->final_fn(hash, ctx); -} - -static inline void git_hash_final_oid(struct object_id *oid, struct git_hash_ctx *ctx) -{ - ctx->algop->final_oid_fn(oid, ctx); -} - +void git_hash_init(struct git_hash_ctx *ctx, const struct git_hash_algo *algop); +void git_hash_clone(struct git_hash_ctx *dst, const struct git_hash_ctx *src); +void git_hash_update(struct git_hash_ctx *ctx, const void *in, size_t len); +void git_hash_final(unsigned char *hash, struct git_hash_ctx *ctx); +void git_hash_final_oid(struct object_id *oid, struct git_hash_ctx *ctx); const struct git_hash_algo *hash_algo_ptr_by_number(uint32_t algo); +struct git_hash_ctx *git_hash_alloc(void); +void git_hash_free(struct git_hash_ctx *ctx); /* * Return a GIT_HASH_* constant based on the name. Returns GIT_HASH_UNKNOWN if * the name doesn't match a known algorithm. ^ permalink raw reply related [flat|nested] 101+ messages in thread
* [PATCH v2 11/15] rust: add a build.rs script for tests 2025-11-17 22:16 ` [PATCH v2 00/15] " brian m. carlson ` (9 preceding siblings ...) 2025-11-17 22:16 ` [PATCH v2 10/15] hash: expose hash context functions to Rust brian m. carlson @ 2025-11-17 22:16 ` brian m. carlson 2025-11-17 22:16 ` [PATCH v2 12/15] rust: add functionality to hash an object brian m. carlson ` (3 subsequent siblings) 14 siblings, 0 replies; 101+ messages in thread From: brian m. carlson @ 2025-11-17 22:16 UTC (permalink / raw) To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren Cargo uses the build.rs script to determine how to compile and link a binary. The only binary we're generating, however, is for our tests, but in a future commit, we're going to link against libgit.a for some functionality and we'll need to make sure the test binaries are complete. Add a build.rs file for this case and specify the files we're going to be linking against. Because we cannot specify different dependencies when building our static library versus our tests, update the Makefile to specify these dependencies for our static library to avoid race conditions during build. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> --- Makefile | 2 +- build.rs | 17 +++++++++++++++++ 2 files changed, 18 insertions(+), 1 deletion(-) create mode 100644 build.rs diff --git a/Makefile b/Makefile index e1d0ae3691..4211d7622a 100644 --- a/Makefile +++ b/Makefile @@ -2964,7 +2964,7 @@ scalar$X: scalar.o GIT-LDFLAGS $(GITLIBS) $(LIB_FILE): $(LIB_OBJS) $(QUIET_AR)$(RM) $@ && $(AR) $(ARFLAGS) $@ $^ -$(RUST_LIB): Cargo.toml $(RUST_SOURCES) +$(RUST_LIB): Cargo.toml $(RUST_SOURCES) $(LIB_FILE) $(QUIET_CARGO)cargo build $(CARGO_ARGS) .PHONY: rust diff --git a/build.rs b/build.rs new file mode 100644 index 0000000000..3724b3a930 --- /dev/null +++ b/build.rs @@ -0,0 +1,17 @@ +// This program is free software; you can redistribute it and/or modify +// it under the terms of the GNU General Public License as published by +// the Free Software Foundation: version 2 of the License, dated June 1991. +// +// This program is distributed in the hope that it will be useful, +// but WITHOUT ANY WARRANTY; without even the implied warranty of +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +// GNU General Public License for more details. +// +// You should have received a copy of the GNU General Public License along +// with this program; if not, see <https://www.gnu.org/licenses/>. + +fn main() { + println!("cargo:rustc-link-search=."); + println!("cargo:rustc-link-lib=git"); + println!("cargo:rustc-link-lib=z"); +} ^ permalink raw reply related [flat|nested] 101+ messages in thread
* [PATCH v2 12/15] rust: add functionality to hash an object 2025-11-17 22:16 ` [PATCH v2 00/15] " brian m. carlson ` (10 preceding siblings ...) 2025-11-17 22:16 ` [PATCH v2 11/15] rust: add a build.rs script for tests brian m. carlson @ 2025-11-17 22:16 ` brian m. carlson 2025-11-17 22:16 ` [PATCH v2 13/15] rust: add a new binary object map format brian m. carlson ` (2 subsequent siblings) 14 siblings, 0 replies; 101+ messages in thread From: brian m. carlson @ 2025-11-17 22:16 UTC (permalink / raw) To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren In a future commit, we'll want to hash some data when dealing with an object map. Let's make this easy by creating a structure to hash objects and calling into the C functions as necessary to perform the hashing. For now, we only implement safe hashing, but in the future we could add unsafe hashing if we want. Implement Clone and Drop to appropriately manage our memory. Additionally implement Write to make it easy to use with other formats that implement this trait. While we're at it, add some tests for the various hashing cases. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> --- src/hash.rs | 143 +++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 142 insertions(+), 1 deletion(-) diff --git a/src/hash.rs b/src/hash.rs index e1fa568661..dea2998de4 100644 --- a/src/hash.rs +++ b/src/hash.rs @@ -12,6 +12,7 @@ use std::error::Error; use std::fmt::{self, Debug, Display}; +use std::io::{self, Write}; use std::os::raw::c_void; pub const GIT_MAX_RAWSZ: usize = 32; @@ -111,6 +112,100 @@ impl Debug for ObjectID { } } +/// A trait to implement hashing with a cryptographic algorithm. +pub trait CryptoDigest { + /// Return true if this digest is safe for use with untrusted data, false otherwise. + fn is_safe(&self) -> bool; + + /// Update the digest with the specified data. + fn update(&mut self, data: &[u8]); + + /// Return an object ID, consuming the hasher. + fn into_oid(self) -> ObjectID; + + /// Return a hash as a `Vec`, consuming the hasher. + fn into_vec(self) -> Vec<u8>; +} + +/// A structure to hash data with a cryptographic hash algorithm. +/// +/// Instances of this class are safe for use with untrusted data, provided Git has been compiled +/// with a collision-detecting implementation of SHA-1. +pub struct CryptoHasher { + algo: HashAlgorithm, + ctx: *mut c_void, +} + +impl CryptoHasher { + /// Create a new hasher with the algorithm specified with `algo`. + /// + /// This hasher is safe to use on untrusted data. If SHA-1 is selected and Git was compiled + /// with a collision-detecting implementation of SHA-1, then this function will use that + /// implementation and detect any attempts at a collision. + pub fn new(algo: HashAlgorithm) -> Self { + let ctx = unsafe { c::git_hash_alloc() }; + unsafe { c::git_hash_init(ctx, algo.hash_algo_ptr()) }; + Self { algo, ctx } + } +} + +impl CryptoDigest for CryptoHasher { + /// Return true if this digest is safe for use with untrusted data, false otherwise. + fn is_safe(&self) -> bool { + true + } + + /// Update the hasher with the specified data. + fn update(&mut self, data: &[u8]) { + unsafe { c::git_hash_update(self.ctx, data.as_ptr() as *const c_void, data.len()) }; + } + + /// Return an object ID, consuming the hasher. + fn into_oid(self) -> ObjectID { + let mut oid = ObjectID { + hash: [0u8; 32], + algo: self.algo as u32, + }; + unsafe { c::git_hash_final_oid(&mut oid as *mut ObjectID as *mut c_void, self.ctx) }; + oid + } + + /// Return a hash as a `Vec`, consuming the hasher. + fn into_vec(self) -> Vec<u8> { + let mut v = vec![0u8; self.algo.raw_len()]; + unsafe { c::git_hash_final(v.as_mut_ptr(), self.ctx) }; + v + } +} + +impl Clone for CryptoHasher { + fn clone(&self) -> Self { + let ctx = unsafe { c::git_hash_alloc() }; + unsafe { c::git_hash_clone(ctx, self.ctx) }; + Self { + algo: self.algo, + ctx, + } + } +} + +impl Drop for CryptoHasher { + fn drop(&mut self) { + unsafe { c::git_hash_free(self.ctx) }; + } +} + +impl Write for CryptoHasher { + fn write(&mut self, data: &[u8]) -> io::Result<usize> { + self.update(data); + Ok(data.len()) + } + + fn flush(&mut self) -> io::Result<()> { + Ok(()) + } +} + /// A hash algorithm, #[repr(C)] #[derive(Debug, Copy, Clone, Ord, PartialOrd, Eq, PartialEq)] @@ -239,6 +334,11 @@ impl HashAlgorithm { pub fn hash_algo_ptr(self) -> *const c_void { unsafe { c::hash_algo_ptr_by_number(self as u32) } } + + /// Create a hasher for this algorithm. + pub fn hasher(self) -> CryptoHasher { + CryptoHasher::new(self) + } } pub mod c { @@ -246,12 +346,21 @@ pub mod c { extern "C" { pub fn hash_algo_ptr_by_number(n: u32) -> *const c_void; + pub fn unsafe_hash_algo(algop: *const c_void) -> *const c_void; + pub fn git_hash_alloc() -> *mut c_void; + pub fn git_hash_free(ctx: *mut c_void); + pub fn git_hash_init(dst: *mut c_void, algop: *const c_void); + pub fn git_hash_clone(dst: *mut c_void, src: *const c_void); + pub fn git_hash_update(ctx: *mut c_void, inp: *const c_void, len: usize); + pub fn git_hash_final(hash: *mut u8, ctx: *mut c_void); + pub fn git_hash_final_oid(hash: *mut c_void, ctx: *mut c_void); } } #[cfg(test)] mod tests { - use super::HashAlgorithm; + use super::{CryptoDigest, HashAlgorithm, ObjectID}; + use std::io::Write; fn all_algos() -> &'static [HashAlgorithm] { &[HashAlgorithm::SHA1, HashAlgorithm::SHA256] @@ -322,4 +431,36 @@ mod tests { assert_eq!(format!("{:?}", oid), *debug); } } + + #[test] + fn hasher_works_correctly() { + for algo in all_algos() { + let tests: &[(&[u8], &ObjectID)] = &[ + (b"blob 0\0", algo.empty_blob()), + (b"tree 0\0", algo.empty_tree()), + ]; + for (data, oid) in tests { + let mut h = algo.hasher(); + assert!(h.is_safe()); + // Test that this works incrementally. + h.update(&data[0..2]); + h.update(&data[2..]); + + let h2 = h.clone(); + + let actual_oid = h.into_oid(); + assert_eq!(**oid, actual_oid); + + let v = h2.into_vec(); + assert_eq!((*oid).as_slice().unwrap(), &v); + + let mut h = algo.hasher(); + h.write_all(&data[0..2]).unwrap(); + h.write_all(&data[2..]).unwrap(); + + let actual_oid = h.into_oid(); + assert_eq!(**oid, actual_oid); + } + } + } } ^ permalink raw reply related [flat|nested] 101+ messages in thread
* [PATCH v2 13/15] rust: add a new binary object map format 2025-11-17 22:16 ` [PATCH v2 00/15] " brian m. carlson ` (11 preceding siblings ...) 2025-11-17 22:16 ` [PATCH v2 12/15] rust: add functionality to hash an object brian m. carlson @ 2025-11-17 22:16 ` brian m. carlson 2025-11-17 22:16 ` [PATCH v2 14/15] rust: add a small wrapper around the hashfile code brian m. carlson 2025-11-17 22:16 ` [PATCH v2 15/15] object-file-convert: always make sure object ID algo is valid brian m. carlson 14 siblings, 0 replies; 101+ messages in thread From: brian m. carlson @ 2025-11-17 22:16 UTC (permalink / raw) To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren Our current loose object format has a few problems. First, it is not efficient: the list of object IDs is not sorted and even if it were, there would not be an efficient way to look up objects in both algorithms. Second, we need to store mappings for things which are not technically loose objects but are not packed objects, either, and so cannot be stored in a pack index. These kinds of things include shallows, their parents, and their trees, as well as submodules. Yet we also need to implement a sensible way to store the kind of object so that we can prune unneeded entries. For instance, if the user has updated the shallows, we can remove the old values. For these reasons, introduce a new binary object map format. The careful reader will notice that it resembles very closely the pack index v3 format. Add an in-memory object map as well, and allow writing to a batched map, which can then be written later as one of the binary object maps. Include several tests for round tripping and data lookup across algorithms. Note that the use of this code elsewhere in Git will involve some C code and some C-compatible code in Rust that will be introduced in a future commit. Thus, for example, we ignore the fact that if there is no current batch and the caller asks for data to be written, this code does nothing, mostly because this code also does not involve itself with opening or manipulating files. The C code that we will add later will implement this functionality at a higher level and take care of this, since the code which is necessary for writing to the object store is deeply involved with our C abstractions and it would require extensive work (which would not be especially valuable at this point) to port those to Rust. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> --- Documentation/gitformat-loose.adoc | 78 +++ Makefile | 1 + src/lib.rs | 1 + src/loose.rs | 913 +++++++++++++++++++++++++++++ src/meson.build | 1 + 5 files changed, 994 insertions(+) create mode 100644 src/loose.rs diff --git a/Documentation/gitformat-loose.adoc b/Documentation/gitformat-loose.adoc index 947993663e..b0b569761b 100644 --- a/Documentation/gitformat-loose.adoc +++ b/Documentation/gitformat-loose.adoc @@ -10,6 +10,7 @@ SYNOPSIS -------- [verse] $GIT_DIR/objects/[0-9a-f][0-9a-f]/* +$GIT_DIR/objects/object-map/map-*.map DESCRIPTION ----------- @@ -48,6 +49,83 @@ stored under Similarly, a blob containing the contents `abc` would have the uncompressed data of `blob 3\0abc`. +== Loose object mapping + +When the `compatObjectFormat` option is used, Git needs to store a mapping +between the repository's main algorithm and the compatibility algorithm for +loose objects as well as some auxiliary information. + +The mapping consists of a set of files under `$GIT_DIR/objects/object-map` +ending in `.map`. The portion of the filename before the extension is that of +the main hash checksum (that is, the one specified in +`extensions.objectformat`) in hex format. + +`git gc` will repack existing entries into one file, removing any unnecessary +objects, such as obsolete shallow entries or loose objects that have been +packed. + +The file format is as follows. All values are in network byte order and all +4-byte and 8-byte values must be 4-byte aligned in the file, so the NUL padding +may be required in some cases. Git always uses the smallest number of NUL +bytes (including zero) that is required for the padding in order to make +writing files deterministic. + +- A header appears at the beginning and consists of the following: + * A 4-byte mapping signature: `LMAP` + * 4-byte version number: 1 + * 4-byte length of the header section (including reserved entries but + excluding any NUL padding). + * 4-byte number of objects declared in this map file. + * 4-byte number of object formats declared in this map file. + * For each object format: + ** 4-byte format identifier (e.g., `sha1` for SHA-1) + ** 4-byte length in bytes of shortened object names (that is, prefixes of + the full object names). This is the shortest possible length needed to + make names in the shortened object name table unambiguous. + ** 8-byte integer, recording where tables relating to this format + are stored in this index file, as an offset from the beginning. + * 8-byte offset to the trailer from the beginning of this file. + * The remainder of the header section is reserved for future use. + Readers must ignore unrecognized data here. +- Zero or more NUL bytes. These are used to improve the alignment of the + 4-byte quantities below. +- Tables for the first object format: + * A sorted table of shortened object names. These are prefixes of the names + of all objects in this file, packed together to reduce the cache footprint + of the binary search for a specific object name. + * A sorted table of full object names. + * A table of 4-byte metadata values. +- Zero or more NUL bytes. +- Tables for subsequent object formats: + * A sorted table of shortened object names. These are prefixes of the names + of all objects in this file, packed together without offset values to + reduce the cache footprint of the binary search for a specific object name. + * A table of full object names in the order specified by the first object format. + * A table of 4-byte values mapping object name order to the order of the + first object format. For an object in the table of sorted shortened object + names, the value at the corresponding index in this table is the index in + the previous table for that same object. + * Zero or more NUL bytes. +- The trailer consists of the following: + * Hash checksum of all of the above using the main hash. + +The lower six bits of each metadata table contain a type field indicating the +reason that this object is stored: + +0:: + Reserved. +1:: + This object is stored as a loose object in the repository. +2:: + This object is a shallow entry. The mapping refers to a shallow value + returned by a remote server. +3:: + This object is a submodule entry. The mapping refers to the commit stored + representing a submodule. + +Other data may be stored in this field in the future. Bits that are not used +must be zero. + GIT --- Part of the linkgit:git[1] suite diff --git a/Makefile b/Makefile index 4211d7622a..40785c14fd 100644 --- a/Makefile +++ b/Makefile @@ -1536,6 +1536,7 @@ UNIT_TEST_OBJS += $(UNIT_TEST_DIR)/test-lib.o RUST_SOURCES += src/hash.rs RUST_SOURCES += src/lib.rs +RUST_SOURCES += src/loose.rs RUST_SOURCES += src/varint.rs GIT-VERSION-FILE: FORCE diff --git a/src/lib.rs b/src/lib.rs index cf7c962509..442f9433dc 100644 --- a/src/lib.rs +++ b/src/lib.rs @@ -1,2 +1,3 @@ pub mod hash; +pub mod loose; pub mod varint; diff --git a/src/loose.rs b/src/loose.rs new file mode 100644 index 0000000000..24accf9c33 --- /dev/null +++ b/src/loose.rs @@ -0,0 +1,913 @@ +// This program is free software; you can redistribute it and/or modify +// it under the terms of the GNU General Public License as published by +// the Free Software Foundation: version 2 of the License, dated June 1991. +// +// This program is distributed in the hope that it will be useful, +// but WITHOUT ANY WARRANTY; without even the implied warranty of +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +// GNU General Public License for more details. +// +// You should have received a copy of the GNU General Public License along +// with this program; if not, see <https://www.gnu.org/licenses/>. + +use crate::hash::{HashAlgorithm, ObjectID, GIT_MAX_RAWSZ}; +use std::collections::BTreeMap; +use std::convert::TryInto; +use std::io::{self, Write}; + +/// The type of object stored in the map. +/// +/// If this value is `Reserved`, then it is never written to disk and is used primarily to store +/// certain hard-coded objects, like the empty tree, empty blob, or null object ID. +/// +/// If this value is `LooseObject`, then this represents a loose object. `Shallow` represents a +/// shallow commit, its parent, or its tree. `Submodule` represents a submodule commit. +#[repr(C)] +#[derive(Debug, Clone, Copy, Ord, PartialOrd, Eq, PartialEq)] +pub enum MapType { + Reserved = 0, + LooseObject = 1, + Shallow = 2, + Submodule = 3, +} + +impl MapType { + pub fn from_u32(n: u32) -> Option<MapType> { + match n { + 0 => Some(Self::Reserved), + 1 => Some(Self::LooseObject), + 2 => Some(Self::Shallow), + 3 => Some(Self::Submodule), + _ => None, + } + } +} + +/// The value of an object stored in a `ObjectMemoryMap`. +/// +/// This keeps the object ID to which the key is mapped and its kind together. +struct MappedObject { + oid: ObjectID, + kind: MapType, +} + +/// Memory storage for a loose object. +struct ObjectMemoryMap { + to_compat: BTreeMap<ObjectID, MappedObject>, + to_storage: BTreeMap<ObjectID, MappedObject>, + compat: HashAlgorithm, + storage: HashAlgorithm, +} + +impl ObjectMemoryMap { + /// Create a new `ObjectMemoryMap`. + /// + /// The storage and compatibility `HashAlgorithm` instances are used to store the object IDs in + /// the correct map. + fn new(storage: HashAlgorithm, compat: HashAlgorithm) -> Self { + Self { + to_compat: BTreeMap::new(), + to_storage: BTreeMap::new(), + compat, + storage, + } + } + + fn len(&self) -> usize { + self.to_compat.len() + } + + /// Write this map to an interface implementing `std::io::Write`. + fn write<W: Write>(&self, wrtr: W) -> io::Result<()> { + const VERSION_NUMBER: u32 = 1; + const NUM_OBJECT_FORMATS: u32 = 2; + const PADDING: [u8; 4] = [0u8; 4]; + + let mut wrtr = wrtr; + let header_size: u32 = (4 * 5) + (4 + 4 + 8) * NUM_OBJECT_FORMATS + 8; + + wrtr.write_all(b"LMAP")?; + wrtr.write_all(&VERSION_NUMBER.to_be_bytes())?; + wrtr.write_all(&header_size.to_be_bytes())?; + wrtr.write_all(&(self.to_compat.len() as u32).to_be_bytes())?; + wrtr.write_all(&NUM_OBJECT_FORMATS.to_be_bytes())?; + + let storage_short_len = self.find_short_name_len(&self.to_compat, self.storage); + let compat_short_len = self.find_short_name_len(&self.to_storage, self.compat); + + let storage_npadding = Self::required_nul_padding(self.to_compat.len(), storage_short_len); + let compat_npadding = Self::required_nul_padding(self.to_compat.len(), compat_short_len); + + let mut offset: u64 = header_size as u64; + + for (algo, len, npadding) in &[ + (self.storage, storage_short_len, storage_npadding), + (self.compat, compat_short_len, compat_npadding), + ] { + wrtr.write_all(&algo.format_id().to_be_bytes())?; + wrtr.write_all(&(*len as u32).to_be_bytes())?; + + offset += *npadding; + wrtr.write_all(&offset.to_be_bytes())?; + + offset += self.to_compat.len() as u64 * (*len as u64 + algo.raw_len() as u64 + 4); + } + + wrtr.write_all(&offset.to_be_bytes())?; + + let order_map: BTreeMap<&ObjectID, usize> = self + .to_compat + .keys() + .enumerate() + .map(|(i, oid)| (oid, i)) + .collect(); + + wrtr.write_all(&PADDING[0..storage_npadding as usize])?; + for oid in self.to_compat.keys() { + wrtr.write_all(&oid.as_slice().unwrap()[0..storage_short_len])?; + } + for oid in self.to_compat.keys() { + wrtr.write_all(oid.as_slice().unwrap())?; + } + for meta in self.to_compat.values() { + wrtr.write_all(&(meta.kind as u32).to_be_bytes())?; + } + + wrtr.write_all(&PADDING[0..compat_npadding as usize])?; + for oid in self.to_storage.keys() { + wrtr.write_all(&oid.as_slice().unwrap()[0..compat_short_len])?; + } + for meta in self.to_compat.values() { + wrtr.write_all(meta.oid.as_slice().unwrap())?; + } + for meta in self.to_storage.values() { + wrtr.write_all(&(order_map[&meta.oid] as u32).to_be_bytes())?; + } + + Ok(()) + } + + fn required_nul_padding(nitems: usize, short_len: usize) -> u64 { + let shortened_table_len = nitems as u64 * short_len as u64; + let misalignment = shortened_table_len & 3; + // If the value is 0, return 0; otherwise, return the difference from 4. + (4 - misalignment) & 3 + } + + fn last_matching_offset(a: &ObjectID, b: &ObjectID, algop: HashAlgorithm) -> usize { + for i in 0..=algop.raw_len() { + if a.hash[i] != b.hash[i] { + return i; + } + } + algop.raw_len() + } + + fn find_short_name_len( + &self, + map: &BTreeMap<ObjectID, MappedObject>, + algop: HashAlgorithm, + ) -> usize { + if map.len() <= 1 { + return 1; + } + let mut len = 1; + let mut iter = map.keys(); + let mut cur = match iter.next() { + Some(cur) => cur, + None => return len, + }; + for item in iter { + let offset = Self::last_matching_offset(cur, item, algop); + if offset >= len { + len = offset + 1; + } + cur = item; + } + if len > algop.raw_len() { + algop.raw_len() + } else { + len + } + } +} + +struct ObjectFormatData { + data_off: usize, + shortened_len: usize, + full_off: usize, + mapping_off: Option<usize>, +} + +pub struct MmapedObjectMapIter<'a> { + offset: usize, + algos: Vec<HashAlgorithm>, + source: &'a MmapedObjectMap<'a>, +} + +impl<'a> Iterator for MmapedObjectMapIter<'a> { + type Item = Vec<ObjectID>; + + fn next(&mut self) -> Option<Self::Item> { + if self.offset >= self.source.nitems { + return None; + } + let offset = self.offset; + self.offset += 1; + let v: Vec<ObjectID> = self + .algos + .iter() + .cloned() + .filter_map(|algo| self.source.oid_from_offset(offset, algo)) + .collect(); + if v.len() != self.algos.len() { + return None; + } + Some(v) + } +} + +#[allow(dead_code)] +pub struct MmapedObjectMap<'a> { + memory: &'a [u8], + nitems: usize, + meta_off: usize, + obj_formats: BTreeMap<HashAlgorithm, ObjectFormatData>, + main_algo: HashAlgorithm, +} + +#[derive(Debug)] +#[allow(dead_code)] +enum MmapedParseError { + HeaderTooSmall, + InvalidSignature, + InvalidVersion, + UnknownAlgorithm, + OffsetTooLarge, + TooFewObjectFormats, + UnalignedData, + InvalidTrailerOffset, +} + +#[allow(dead_code)] +impl<'a> MmapedObjectMap<'a> { + fn new( + slice: &'a [u8], + hash_algo: HashAlgorithm, + ) -> Result<MmapedObjectMap<'a>, MmapedParseError> { + let object_format_header_size = 4 + 4 + 8; + let trailer_offset_size = 8; + let header_size: usize = + 4 + 4 + 4 + 4 + 4 + object_format_header_size * 2 + trailer_offset_size; + if slice.len() < header_size { + return Err(MmapedParseError::HeaderTooSmall); + } + if slice[0..4] != *b"LMAP" { + return Err(MmapedParseError::InvalidSignature); + } + if Self::u32_at_offset(slice, 4) != 1 { + return Err(MmapedParseError::InvalidVersion); + } + let _ = Self::u32_at_offset(slice, 8) as usize; + let nitems = Self::u32_at_offset(slice, 12) as usize; + let nobj_formats = Self::u32_at_offset(slice, 16) as usize; + if nobj_formats < 2 { + return Err(MmapedParseError::TooFewObjectFormats); + } + let mut offset = 20; + let mut meta_off = None; + let mut data = BTreeMap::new(); + for i in 0..nobj_formats { + if offset + object_format_header_size + trailer_offset_size > slice.len() { + return Err(MmapedParseError::HeaderTooSmall); + } + let format_id = Self::u32_at_offset(slice, offset); + let shortened_len = Self::u32_at_offset(slice, offset + 4) as usize; + let data_off = Self::u64_at_offset(slice, offset + 8); + + let algo = HashAlgorithm::from_format_id(format_id) + .ok_or(MmapedParseError::UnknownAlgorithm)?; + let data_off: usize = data_off + .try_into() + .map_err(|_| MmapedParseError::OffsetTooLarge)?; + + // Every object format must have these entries. + let shortened_table_len = shortened_len + .checked_mul(nitems) + .ok_or(MmapedParseError::OffsetTooLarge)?; + let full_off = data_off + .checked_add(shortened_table_len) + .ok_or(MmapedParseError::OffsetTooLarge)?; + Self::verify_aligned(full_off)?; + Self::verify_valid(slice, full_off as u64)?; + + let full_length = algo + .raw_len() + .checked_mul(nitems) + .ok_or(MmapedParseError::OffsetTooLarge)?; + let off = full_length + .checked_add(full_off) + .ok_or(MmapedParseError::OffsetTooLarge)?; + Self::verify_aligned(off)?; + Self::verify_valid(slice, off as u64)?; + + // This is for the metadata for the first object format and for the order mapping for + // other object formats. + let meta_size = nitems + .checked_mul(4) + .ok_or(MmapedParseError::OffsetTooLarge)?; + let meta_end = off + .checked_add(meta_size) + .ok_or(MmapedParseError::OffsetTooLarge)?; + Self::verify_valid(slice, meta_end as u64)?; + + let mut mapping_off = None; + if i == 0 { + meta_off = Some(off); + } else { + mapping_off = Some(off); + } + + data.insert( + algo, + ObjectFormatData { + data_off, + shortened_len, + full_off, + mapping_off, + }, + ); + offset += object_format_header_size; + } + let trailer = Self::u64_at_offset(slice, offset); + Self::verify_aligned(trailer as usize)?; + Self::verify_valid(slice, trailer)?; + let end = trailer + .checked_add(hash_algo.raw_len() as u64) + .ok_or(MmapedParseError::OffsetTooLarge)?; + if end != slice.len() as u64 { + return Err(MmapedParseError::InvalidTrailerOffset); + } + match meta_off { + Some(meta_off) => Ok(MmapedObjectMap { + memory: slice, + nitems, + meta_off, + obj_formats: data, + main_algo: hash_algo, + }), + None => Err(MmapedParseError::TooFewObjectFormats), + } + } + + fn iter(&self) -> MmapedObjectMapIter<'_> { + let mut algos = Vec::with_capacity(self.obj_formats.len()); + algos.push(self.main_algo); + for algo in self.obj_formats.keys().cloned() { + if algo != self.main_algo { + algos.push(algo); + } + } + MmapedObjectMapIter { + offset: 0, + algos, + source: self, + } + } + + /// Treats `sl` as if it were a set of slices of `wanted.len()` bytes, and searches for + /// `wanted` within it. + /// + /// If found, returns the offset of the subslice in `sl`. + /// + /// ``` + /// let sl = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]; + /// + /// assert_eq!(MmapedObjectMap::binary_search_slice(sl, &[2, 3]), Some(1)); + /// assert_eq!(MmapedObjectMap::binary_search_slice(sl, &[6, 7]), Some(4)); + /// assert_eq!(MmapedObjectMap::binary_search_slice(sl, &[1, 2]), None); + /// assert_eq!(MmapedObjectMap::binary_search_slice(sl, &[10, 20]), None); + /// ``` + fn binary_search_slice(sl: &[u8], wanted: &[u8]) -> Option<usize> { + let len = wanted.len(); + let res = sl.binary_search_by(|item| { + // We would like element_offset, but that is currently nightly only. Instead, do a + // pointer subtraction to find the index. + let index = unsafe { (item as *const u8).offset_from(sl.as_ptr()) } as usize; + // Now we have the index of this object. Round it down to the nearest full-sized + // chunk to find the actual offset where this starts. + let index = index - (index % len); + // Compute the comparison of that value instead, which will provide the expected + // result. + sl[index..index + wanted.len()].cmp(wanted) + }); + res.ok().map(|offset| offset / len) + } + + /// Look up `oid` in the map in order to convert it to `algo`. + /// + /// If this object is in the map, return the offset in the table for the main algorithm. + fn look_up_object(&self, oid: &ObjectID) -> Option<usize> { + let oid_algo = HashAlgorithm::from_u32(oid.algo)?; + let params = self.obj_formats.get(&oid_algo)?; + let short_table = + &self.memory[params.data_off..params.data_off + (params.shortened_len * self.nitems)]; + let index = Self::binary_search_slice( + short_table, + &oid.as_slice().unwrap()[0..params.shortened_len], + )?; + match params.mapping_off { + Some(from_off) => { + // oid is in a compatibility algorithm. Find the mapping index. + let mapped = Self::u32_at_offset(self.memory, from_off + index * 4) as usize; + if mapped >= self.nitems { + return None; + } + let oid_offset = params.full_off + mapped * oid_algo.raw_len(); + if self.memory[oid_offset..oid_offset + oid_algo.raw_len()] + != *oid.as_slice().unwrap() + { + return None; + } + Some(mapped) + } + None => { + // oid is in the main algorithm. Find the object ID in the main map to confirm + // it's correct. + let oid_offset = params.full_off + index * oid_algo.raw_len(); + if self.memory[oid_offset..oid_offset + oid_algo.raw_len()] + != *oid.as_slice().unwrap() + { + return None; + } + Some(index) + } + } + } + + #[allow(dead_code)] + fn map_object(&self, oid: &ObjectID, algo: HashAlgorithm) -> Option<MappedObject> { + let main = self.look_up_object(oid)?; + let meta = MapType::from_u32(Self::u32_at_offset(self.memory, self.meta_off + (main * 4)))?; + Some(MappedObject { + oid: self.oid_from_offset(main, algo)?, + kind: meta, + }) + } + + fn map_oid(&self, oid: &ObjectID, algo: HashAlgorithm) -> Option<ObjectID> { + if algo as u32 == oid.algo { + return Some(oid.clone()); + } + + let main = self.look_up_object(oid)?; + self.oid_from_offset(main, algo) + } + + fn oid_from_offset(&self, offset: usize, algo: HashAlgorithm) -> Option<ObjectID> { + let aparams = self.obj_formats.get(&algo)?; + + let mut hash = [0u8; GIT_MAX_RAWSZ]; + let len = algo.raw_len(); + let oid_off = aparams.full_off + (offset * len); + hash[0..len].copy_from_slice(&self.memory[oid_off..oid_off + len]); + Some(ObjectID { + hash, + algo: algo as u32, + }) + } + + fn u32_at_offset(slice: &[u8], offset: usize) -> u32 { + u32::from_be_bytes(slice[offset..offset + 4].try_into().unwrap()) + } + + fn u64_at_offset(slice: &[u8], offset: usize) -> u64 { + u64::from_be_bytes(slice[offset..offset + 8].try_into().unwrap()) + } + + fn verify_aligned(offset: usize) -> Result<(), MmapedParseError> { + if (offset & 3) != 0 { + return Err(MmapedParseError::UnalignedData); + } + Ok(()) + } + + fn verify_valid(slice: &[u8], offset: u64) -> Result<(), MmapedParseError> { + if offset >= slice.len() as u64 { + return Err(MmapedParseError::OffsetTooLarge); + } + Ok(()) + } +} + +/// A map for loose and other non-packed object IDs that maps between a storage and compatibility +/// mapping. +/// +/// In addition to the in-memory option, there is an optional batched storage, which can be used to +/// write objects to disk in an efficient way. +pub struct ObjectMap { + mem: ObjectMemoryMap, + batch: Option<ObjectMemoryMap>, +} + +impl ObjectMap { + /// Create a new `ObjectMap` with the given hash algorithms. + /// + /// This initializes the memory map to automatically map the empty tree, empty blob, and null + /// object ID. + pub fn new(storage: HashAlgorithm, compat: HashAlgorithm) -> Self { + let mut map = ObjectMemoryMap::new(storage, compat); + for (main, compat) in &[ + (storage.empty_tree(), compat.empty_tree()), + (storage.empty_blob(), compat.empty_blob()), + (storage.null_oid(), compat.null_oid()), + ] { + map.to_storage.insert( + (*compat).clone(), + MappedObject { + oid: (*main).clone(), + kind: MapType::Reserved, + }, + ); + map.to_compat.insert( + (*main).clone(), + MappedObject { + oid: (*compat).clone(), + kind: MapType::Reserved, + }, + ); + } + Self { + mem: map, + batch: None, + } + } + + pub fn hash_algo(&self) -> HashAlgorithm { + self.mem.storage + } + + /// Start a batch for efficient writing. + /// + /// If there is already a batch started, this does nothing and the existing batch is retained. + pub fn start_batch(&mut self) { + if self.batch.is_none() { + self.batch = Some(ObjectMemoryMap::new(self.mem.storage, self.mem.compat)); + } + } + + pub fn batch_len(&self) -> Option<usize> { + self.batch.as_ref().map(|b| b.len()) + } + + /// If a batch exists, write it to the writer. + pub fn finish_batch<W: Write>(&mut self, w: W) -> io::Result<()> { + if let Some(txn) = self.batch.take() { + txn.write(w)?; + } + Ok(()) + } + + /// If a batch exists, write it to the writer. + pub fn abort_batch(&mut self) { + self.batch = None; + } + + /// Return whether there is a batch already started. + /// + /// If you just want a batch to exist and don't care whether one has already been started, you + /// may simply call `start_batch` unconditionally. + pub fn has_batch(&self) -> bool { + self.batch.is_some() + } + + /// Insert an object into the map. + /// + /// If `write` is true and there is a batch started, write the object into the batch as well as + /// into the memory map. + pub fn insert(&mut self, oid1: &ObjectID, oid2: &ObjectID, kind: MapType, write: bool) { + let (compat_oid, storage_oid) = + if HashAlgorithm::from_u32(oid1.algo) == Some(self.mem.compat) { + (oid1, oid2) + } else { + (oid2, oid1) + }; + Self::insert_into(&mut self.mem, storage_oid, compat_oid, kind); + if write { + if let Some(ref mut batch) = self.batch { + Self::insert_into(batch, storage_oid, compat_oid, kind); + } + } + } + + fn insert_into( + map: &mut ObjectMemoryMap, + storage: &ObjectID, + compat: &ObjectID, + kind: MapType, + ) { + map.to_compat.insert( + storage.clone(), + MappedObject { + oid: compat.clone(), + kind, + }, + ); + map.to_storage.insert( + compat.clone(), + MappedObject { + oid: storage.clone(), + kind, + }, + ); + } + + #[allow(dead_code)] + fn map_object(&self, oid: &ObjectID, algo: HashAlgorithm) -> Option<&MappedObject> { + let map = if algo == self.mem.storage { + &self.mem.to_storage + } else { + &self.mem.to_compat + }; + map.get(oid) + } + + #[allow(dead_code)] + fn map_oid<'a, 'b: 'a>( + &'b self, + oid: &'a ObjectID, + algo: HashAlgorithm, + ) -> Option<&'a ObjectID> { + if algo as u32 == oid.algo { + return Some(oid); + } + let entry = self.map_object(oid, algo); + entry.map(|obj| &obj.oid) + } +} + +#[cfg(test)] +mod tests { + use super::{MapType, MmapedObjectMap, ObjectMap, ObjectMemoryMap}; + use crate::hash::{CryptoDigest, CryptoHasher, HashAlgorithm, ObjectID}; + use std::convert::TryInto; + use std::io::{self, Cursor, Write}; + + struct TrailingWriter { + curs: Cursor<Vec<u8>>, + hasher: CryptoHasher, + } + + impl TrailingWriter { + fn new() -> Self { + Self { + curs: Cursor::new(Vec::new()), + hasher: CryptoHasher::new(HashAlgorithm::SHA256), + } + } + + fn finalize(mut self) -> Vec<u8> { + let _ = self.hasher.flush(); + let mut v = self.curs.into_inner(); + v.extend(self.hasher.into_vec()); + v + } + } + + impl Write for TrailingWriter { + fn write(&mut self, data: &[u8]) -> io::Result<usize> { + self.hasher.write_all(data)?; + self.curs.write_all(data)?; + Ok(data.len()) + } + + fn flush(&mut self) -> io::Result<()> { + self.hasher.flush()?; + self.curs.flush()?; + Ok(()) + } + } + + fn sha1_oid(b: &[u8]) -> ObjectID { + assert_eq!(b.len(), 20); + let mut data = [0u8; 32]; + data[0..20].copy_from_slice(b); + ObjectID { + hash: data, + algo: HashAlgorithm::SHA1 as u32, + } + } + + fn sha256_oid(b: &[u8]) -> ObjectID { + assert_eq!(b.len(), 32); + ObjectID { + hash: b.try_into().unwrap(), + algo: HashAlgorithm::SHA256 as u32, + } + } + + #[allow(clippy::type_complexity)] + fn test_entries() -> &'static [(&'static str, &'static [u8], &'static [u8], MapType, bool)] { + // These are all example blobs containing the content in the first argument. + &[ + ("abc", b"\xf2\xba\x8f\x84\xab\x5c\x1b\xce\x84\xa7\xb4\x41\xcb\x19\x59\xcf\xc7\x09\x3b\x7f", b"\xc1\xcf\x6e\x46\x50\x77\x93\x0e\x88\xdc\x51\x36\x64\x1d\x40\x2f\x72\xa2\x29\xdd\xd9\x96\xf6\x27\xd6\x0e\x96\x39\xea\xba\x35\xa6", MapType::LooseObject, false), + ("def", b"\x0c\x00\x38\x32\xe7\xbf\xa9\xca\x8b\x5c\x20\x35\xc9\xbd\x68\x4a\x5f\x26\x23\xbc", b"\x8a\x90\x17\x26\x48\x4d\xb0\xf2\x27\x9f\x30\x8d\x58\x96\xd9\x6b\xf6\x3a\xd6\xde\x95\x7c\xa3\x8a\xdc\x33\x61\x68\x03\x6e\xf6\x63", MapType::Shallow, true), + ("ghi", b"\x45\xa8\x2e\x29\x5c\x52\x47\x31\x14\xc5\x7c\x18\xf4\xf5\x23\x68\xdf\x2a\x3c\xfd", b"\x6e\x47\x4c\x74\xf5\xd7\x78\x14\xc7\xf7\xf0\x7c\x37\x80\x07\x90\x53\x42\xaf\x42\x81\xe6\x86\x8d\x33\x46\x45\x4b\xb8\x63\xab\xc3", MapType::Submodule, false), + ("jkl", b"\x45\x32\x8c\x36\xff\x2e\x9b\x9b\x4e\x59\x2c\x84\x7d\x3f\x9a\x7f\xd9\xb3\xe7\x16", b"\xc3\xee\xf7\x54\xa2\x1e\xc6\x9d\x43\x75\xbe\x6f\x18\x47\x89\xa8\x11\x6f\xd9\x66\xfc\x67\xdc\x31\xd2\x11\x15\x42\xc8\xd5\xa0\xaf", MapType::LooseObject, true), + ] + } + + fn test_map(write_all: bool) -> Box<ObjectMap> { + let mut map = Box::new(ObjectMap::new(HashAlgorithm::SHA256, HashAlgorithm::SHA1)); + + map.start_batch(); + + for (_blob_content, sha1, sha256, kind, swap) in test_entries() { + let s256 = sha256_oid(sha256); + let s1 = sha1_oid(sha1); + let write = write_all || (*kind as u32 & 2) == 0; + if *swap { + // Insert the item into the batch arbitrarily based on the type. This tests that + // we can specify either order and we'll do the right thing. + map.insert(&s256, &s1, *kind, write); + } else { + map.insert(&s1, &s256, *kind, write); + } + } + + map + } + + #[test] + fn can_read_and_write_format() { + for full in &[true, false] { + let mut map = test_map(*full); + let mut wrtr = TrailingWriter::new(); + map.finish_batch(&mut wrtr).unwrap(); + + assert!(!map.has_batch()); + + let data = wrtr.finalize(); + MmapedObjectMap::new(&data, HashAlgorithm::SHA256).unwrap(); + } + } + + #[test] + fn looks_up_from_mmaped() { + let mut map = test_map(true); + let mut wrtr = TrailingWriter::new(); + map.finish_batch(&mut wrtr).unwrap(); + + assert!(!map.has_batch()); + + let data = wrtr.finalize(); + let entries = test_entries(); + let map = MmapedObjectMap::new(&data, HashAlgorithm::SHA256).unwrap(); + + for (_, sha1, sha256, kind, _) in entries { + let s256 = sha256_oid(sha256); + let s1 = sha1_oid(sha1); + + let res = map.map_object(&s256, HashAlgorithm::SHA1).unwrap(); + assert_eq!(res.oid, s1); + assert_eq!(res.kind, *kind); + let res = map.map_oid(&s256, HashAlgorithm::SHA1).unwrap(); + assert_eq!(res, s1); + + let res = map.map_object(&s256, HashAlgorithm::SHA256).unwrap(); + assert_eq!(res.oid, s256); + assert_eq!(res.kind, *kind); + let res = map.map_oid(&s256, HashAlgorithm::SHA256).unwrap(); + assert_eq!(res, s256); + + let res = map.map_object(&s1, HashAlgorithm::SHA256).unwrap(); + assert_eq!(res.oid, s256); + assert_eq!(res.kind, *kind); + let res = map.map_oid(&s1, HashAlgorithm::SHA256).unwrap(); + assert_eq!(res, s256); + + let res = map.map_object(&s1, HashAlgorithm::SHA1).unwrap(); + assert_eq!(res.oid, s1); + assert_eq!(res.kind, *kind); + let res = map.map_oid(&s1, HashAlgorithm::SHA1).unwrap(); + assert_eq!(res, s1); + } + + for octet in &[0x00u8, 0x6d, 0x6e, 0x8a, 0xff] { + let missing_oid = ObjectID { + hash: [*octet; 32], + algo: HashAlgorithm::SHA256 as u32, + }; + + assert!(map.map_object(&missing_oid, HashAlgorithm::SHA1).is_none()); + assert!(map.map_oid(&missing_oid, HashAlgorithm::SHA1).is_none()); + + assert_eq!( + map.map_oid(&missing_oid, HashAlgorithm::SHA256).unwrap(), + missing_oid + ); + } + } + + #[test] + fn binary_searches_slices_correctly() { + let sl = &[ + 0, 1, 2, 15, 14, 13, 18, 10, 2, 20, 20, 20, 21, 21, 0, 21, 21, 1, 21, 21, 21, 21, 21, + 22, 22, 23, 24, + ]; + + let expected: &[(&[u8], Option<usize>)] = &[ + (&[0, 1, 2], Some(0)), + (&[15, 14, 13], Some(1)), + (&[18, 10, 2], Some(2)), + (&[20, 20, 20], Some(3)), + (&[21, 21, 0], Some(4)), + (&[21, 21, 1], Some(5)), + (&[21, 21, 21], Some(6)), + (&[21, 21, 22], Some(7)), + (&[22, 23, 24], Some(8)), + (&[2, 15, 14], None), + (&[0, 21, 21], None), + (&[21, 21, 23], None), + (&[22, 22, 23], None), + (&[0xff, 0xff, 0xff], None), + (&[0, 0, 0], None), + ]; + + for (wanted, value) in expected { + assert_eq!(MmapedObjectMap::binary_search_slice(sl, wanted), *value); + } + } + + #[test] + fn looks_up_oid_correctly() { + let map = test_map(false); + let entries = test_entries(); + + let s256 = sha256_oid(entries[0].2); + let s1 = sha1_oid(entries[0].1); + + let missing_oid = ObjectID { + hash: [0xffu8; 32], + algo: HashAlgorithm::SHA256 as u32, + }; + + let res = map.map_object(&s256, HashAlgorithm::SHA1).unwrap(); + assert_eq!(res.oid, s1); + assert_eq!(res.kind, MapType::LooseObject); + let res = map.map_oid(&s256, HashAlgorithm::SHA1).unwrap(); + assert_eq!(*res, s1); + + let res = map.map_object(&s1, HashAlgorithm::SHA256).unwrap(); + assert_eq!(res.oid, s256); + assert_eq!(res.kind, MapType::LooseObject); + let res = map.map_oid(&s1, HashAlgorithm::SHA256).unwrap(); + assert_eq!(*res, s256); + + assert!(map.map_object(&missing_oid, HashAlgorithm::SHA1).is_none()); + assert!(map.map_oid(&missing_oid, HashAlgorithm::SHA1).is_none()); + + assert_eq!( + *map.map_oid(&missing_oid, HashAlgorithm::SHA256).unwrap(), + missing_oid + ); + } + + #[test] + fn looks_up_known_oids_correctly() { + let map = test_map(false); + + let funcs: &[&dyn Fn(HashAlgorithm) -> &'static ObjectID] = &[ + &|h: HashAlgorithm| h.empty_tree(), + &|h: HashAlgorithm| h.empty_blob(), + &|h: HashAlgorithm| h.null_oid(), + ]; + + for f in funcs { + let s256 = f(HashAlgorithm::SHA256); + let s1 = f(HashAlgorithm::SHA1); + + let res = map.map_object(s256, HashAlgorithm::SHA1).unwrap(); + assert_eq!(res.oid, *s1); + assert_eq!(res.kind, MapType::Reserved); + let res = map.map_oid(s256, HashAlgorithm::SHA1).unwrap(); + assert_eq!(*res, *s1); + + let res = map.map_object(s1, HashAlgorithm::SHA256).unwrap(); + assert_eq!(res.oid, *s256); + assert_eq!(res.kind, MapType::Reserved); + let res = map.map_oid(s1, HashAlgorithm::SHA256).unwrap(); + assert_eq!(*res, *s256); + } + } + + #[test] + fn nul_padding() { + assert_eq!(ObjectMemoryMap::required_nul_padding(1, 1), 3); + assert_eq!(ObjectMemoryMap::required_nul_padding(2, 1), 2); + assert_eq!(ObjectMemoryMap::required_nul_padding(3, 1), 1); + assert_eq!(ObjectMemoryMap::required_nul_padding(2, 2), 0); + + assert_eq!(ObjectMemoryMap::required_nul_padding(39, 3), 3); + } +} diff --git a/src/meson.build b/src/meson.build index c77041a3fa..1eea068519 100644 --- a/src/meson.build +++ b/src/meson.build @@ -1,6 +1,7 @@ libgit_rs_sources = [ 'hash.rs', 'lib.rs', + 'loose.rs', 'varint.rs', ] ^ permalink raw reply related [flat|nested] 101+ messages in thread
* [PATCH v2 14/15] rust: add a small wrapper around the hashfile code 2025-11-17 22:16 ` [PATCH v2 00/15] " brian m. carlson ` (12 preceding siblings ...) 2025-11-17 22:16 ` [PATCH v2 13/15] rust: add a new binary object map format brian m. carlson @ 2025-11-17 22:16 ` brian m. carlson 2025-11-17 22:16 ` [PATCH v2 15/15] object-file-convert: always make sure object ID algo is valid brian m. carlson 14 siblings, 0 replies; 101+ messages in thread From: brian m. carlson @ 2025-11-17 22:16 UTC (permalink / raw) To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren Our new binary object map code avoids needing to be intimately involved with file handling by simply writing data to an object implement Write. This makes it very easy to test by writing to a Cursor wrapping a Vec for tests, and thus decouples it from intimate knowledge about how we handle files. However, we will actually want to write our data to an actual file, since that's the most practical way to persist data. Implement a wrapper around the hashfile code that implements the Write trait so that we can write our object map into a file. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> --- Makefile | 1 + src/csum_file.rs | 81 ++++++++++++++++++++++++++++++++++++++++++++++++ src/lib.rs | 1 + src/meson.build | 1 + 4 files changed, 84 insertions(+) create mode 100644 src/csum_file.rs diff --git a/Makefile b/Makefile index 40785c14fd..b05709c5e9 100644 --- a/Makefile +++ b/Makefile @@ -1534,6 +1534,7 @@ CLAR_TEST_OBJS += $(UNIT_TEST_DIR)/unit-test.o UNIT_TEST_OBJS += $(UNIT_TEST_DIR)/test-lib.o +RUST_SOURCES += src/csum_file.rs RUST_SOURCES += src/hash.rs RUST_SOURCES += src/lib.rs RUST_SOURCES += src/loose.rs diff --git a/src/csum_file.rs b/src/csum_file.rs new file mode 100644 index 0000000000..7f2c6c4fcb --- /dev/null +++ b/src/csum_file.rs @@ -0,0 +1,81 @@ +// This program is free software; you can redistribute it and/or modify +// it under the terms of the GNU General Public License as published by +// the Free Software Foundation: version 2 of the License, dated June 1991. +// +// This program is distributed in the hope that it will be useful, +// but WITHOUT ANY WARRANTY; without even the implied warranty of +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +// GNU General Public License for more details. +// +// You should have received a copy of the GNU General Public License along +// with this program; if not, see <https://www.gnu.org/licenses/>. + +use crate::hash::{HashAlgorithm, GIT_MAX_RAWSZ}; +use std::ffi::CStr; +use std::io::{self, Write}; +use std::os::raw::c_void; + +/// A writer that can write files identified by their hash or containing a trailing hash. +pub struct HashFile { + ptr: *mut c_void, + algo: HashAlgorithm, +} + +impl HashFile { + /// Create a new HashFile. + /// + /// The hash used will be `algo`, its name should be in `name`, and an open file descriptor + /// pointing to that file should be in `fd`. + pub fn new(algo: HashAlgorithm, fd: i32, name: &CStr) -> HashFile { + HashFile { + ptr: unsafe { c::hashfd(algo.hash_algo_ptr(), fd, name.as_ptr()) }, + algo, + } + } + + /// Finalize this HashFile instance. + /// + /// Returns the hash computed over the data. + pub fn finalize(self, component: u32, flags: u32) -> Vec<u8> { + let mut result = vec![0u8; GIT_MAX_RAWSZ]; + unsafe { c::finalize_hashfile(self.ptr, result.as_mut_ptr(), component, flags) }; + result.truncate(self.algo.raw_len()); + result + } +} + +impl Write for HashFile { + fn write(&mut self, data: &[u8]) -> io::Result<usize> { + for chunk in data.chunks(u32::MAX as usize) { + unsafe { + c::hashwrite( + self.ptr, + chunk.as_ptr() as *const c_void, + chunk.len() as u32, + ) + }; + } + Ok(data.len()) + } + + fn flush(&mut self) -> io::Result<()> { + unsafe { c::hashflush(self.ptr) }; + Ok(()) + } +} + +pub mod c { + use std::os::raw::{c_char, c_int, c_void}; + + extern "C" { + pub fn hashfd(algop: *const c_void, fd: i32, name: *const c_char) -> *mut c_void; + pub fn hashwrite(f: *mut c_void, data: *const c_void, len: u32); + pub fn hashflush(f: *mut c_void); + pub fn finalize_hashfile( + f: *mut c_void, + data: *mut u8, + component: u32, + flags: u32, + ) -> c_int; + } +} diff --git a/src/lib.rs b/src/lib.rs index 442f9433dc..0c598298b1 100644 --- a/src/lib.rs +++ b/src/lib.rs @@ -1,3 +1,4 @@ +pub mod csum_file; pub mod hash; pub mod loose; pub mod varint; diff --git a/src/meson.build b/src/meson.build index 1eea068519..45739957b4 100644 --- a/src/meson.build +++ b/src/meson.build @@ -1,4 +1,5 @@ libgit_rs_sources = [ + 'csum_file.rs', 'hash.rs', 'lib.rs', 'loose.rs', ^ permalink raw reply related [flat|nested] 101+ messages in thread
* [PATCH v2 15/15] object-file-convert: always make sure object ID algo is valid 2025-11-17 22:16 ` [PATCH v2 00/15] " brian m. carlson ` (13 preceding siblings ...) 2025-11-17 22:16 ` [PATCH v2 14/15] rust: add a small wrapper around the hashfile code brian m. carlson @ 2025-11-17 22:16 ` brian m. carlson 14 siblings, 0 replies; 101+ messages in thread From: brian m. carlson @ 2025-11-17 22:16 UTC (permalink / raw) To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Ezekiel Newren In some cases, we zero-initialize our object IDs, which sets the algo member to zero as well, which is not a valid algorithm number. This is a bad practice, but we typically paper over it in many cases by simply substituting the repository's hash algorithm. However, our new Rust loose object map code doesn't handle this gracefully and can't find object IDs when the algorithm is zero because they don't compare equal to those with the correct algo field. In addition, the comparison code doesn't have any knowledge of what the main algorithm is because that's global state, so we can't adjust the comparison. To make our code function properly and to avoid propagating these bad entries, if we get a source object ID with a zero algo, just make a copy of it with the fixed algorithm. This has the benefit of also fixing the object IDs if we're in a single algorithm mode as well. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> --- object-file-convert.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/object-file-convert.c b/object-file-convert.c index e44c821084..f8dce94811 100644 --- a/object-file-convert.c +++ b/object-file-convert.c @@ -13,7 +13,7 @@ #include "gpg-interface.h" #include "object-file-convert.h" -int repo_oid_to_algop(struct repository *repo, const struct object_id *src, +int repo_oid_to_algop(struct repository *repo, const struct object_id *srcoid, const struct git_hash_algo *to, struct object_id *dest) { /* @@ -21,7 +21,15 @@ int repo_oid_to_algop(struct repository *repo, const struct object_id *src, * default hash algorithm for that object. */ const struct git_hash_algo *from = - src->algo ? &hash_algos[src->algo] : repo->hash_algo; + srcoid->algo ? &hash_algos[srcoid->algo] : repo->hash_algo; + struct object_id temp; + const struct object_id *src = srcoid; + + if (!srcoid->algo) { + oidcpy(&temp, srcoid); + temp.algo = hash_algo_by_ptr(repo->hash_algo); + src = &temp; + } if (from == to || !to) { if (src != dest) ^ permalink raw reply related [flat|nested] 101+ messages in thread
end of thread, other threads:[~2025-11-20 23:14 UTC | newest] Thread overview: 101+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-10-27 0:43 [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 brian m. carlson 2025-10-27 0:43 ` [PATCH 01/14] repository: require Rust support for interoperability brian m. carlson 2025-10-28 9:16 ` Patrick Steinhardt 2025-10-27 0:43 ` [PATCH 02/14] conversion: don't crash when no destination algo brian m. carlson 2025-10-27 0:43 ` [PATCH 03/14] hash: use uint32_t for object_id algorithm brian m. carlson 2025-10-28 9:16 ` Patrick Steinhardt 2025-10-28 18:28 ` Ezekiel Newren 2025-10-28 19:33 ` Junio C Hamano 2025-10-28 19:58 ` Ezekiel Newren 2025-10-28 20:20 ` Junio C Hamano 2025-10-30 0:23 ` brian m. carlson 2025-10-30 1:58 ` Collin Funk 2025-11-03 1:30 ` brian m. carlson 2025-10-29 0:33 ` brian m. carlson 2025-10-29 9:07 ` Patrick Steinhardt 2025-10-27 0:43 ` [PATCH 04/14] rust: add a ObjectID struct brian m. carlson 2025-10-28 9:17 ` Patrick Steinhardt 2025-10-28 19:07 ` Ezekiel Newren 2025-10-29 0:42 ` brian m. carlson 2025-10-28 19:40 ` Junio C Hamano 2025-10-29 0:47 ` brian m. carlson 2025-10-29 0:36 ` brian m. carlson 2025-10-29 9:08 ` Patrick Steinhardt 2025-10-30 0:32 ` brian m. carlson 2025-10-27 0:43 ` [PATCH 05/14] rust: add a hash algorithm abstraction brian m. carlson 2025-10-28 9:18 ` Patrick Steinhardt 2025-10-28 17:09 ` Ezekiel Newren 2025-10-28 20:00 ` Junio C Hamano 2025-10-28 20:03 ` Ezekiel Newren 2025-10-29 13:27 ` Junio C Hamano 2025-10-29 14:32 ` Junio C Hamano 2025-10-27 0:43 ` [PATCH 06/14] hash: add a function to look up hash algo structs brian m. carlson 2025-10-28 9:18 ` Patrick Steinhardt 2025-10-28 20:12 ` Junio C Hamano 2025-11-04 1:48 ` brian m. carlson 2025-11-04 10:24 ` Junio C Hamano 2025-10-27 0:43 ` [PATCH 07/14] csum-file: define hashwrite's count as a uint32_t brian m. carlson 2025-10-28 17:22 ` Ezekiel Newren 2025-10-27 0:43 ` [PATCH 08/14] write-or-die: add an fsync component for the loose object map brian m. carlson 2025-10-27 0:43 ` [PATCH 09/14] hash: expose hash context functions to Rust brian m. carlson 2025-10-29 16:32 ` Junio C Hamano 2025-10-30 21:42 ` brian m. carlson 2025-10-30 21:52 ` Junio C Hamano 2025-10-27 0:44 ` [PATCH 10/14] rust: add a build.rs script for tests brian m. carlson 2025-10-28 9:18 ` Patrick Steinhardt 2025-10-28 17:42 ` Ezekiel Newren 2025-10-29 16:43 ` Junio C Hamano 2025-10-29 22:10 ` Ezekiel Newren 2025-10-29 23:12 ` Junio C Hamano 2025-10-30 6:26 ` Patrick Steinhardt 2025-10-30 13:54 ` Junio C Hamano 2025-10-31 22:43 ` Ezekiel Newren 2025-11-01 11:18 ` Junio C Hamano 2025-10-27 0:44 ` [PATCH 11/14] rust: add functionality to hash an object brian m. carlson 2025-10-28 9:18 ` Patrick Steinhardt 2025-10-29 0:53 ` brian m. carlson 2025-10-29 9:07 ` Patrick Steinhardt 2025-10-28 18:05 ` Ezekiel Newren 2025-10-29 1:05 ` brian m. carlson 2025-10-29 16:02 ` Ben Knoble 2025-10-27 0:44 ` [PATCH 12/14] rust: add a new binary loose object map format brian m. carlson 2025-10-28 9:18 ` Patrick Steinhardt 2025-10-29 1:37 ` brian m. carlson 2025-10-29 9:07 ` Patrick Steinhardt 2025-10-29 17:03 ` Junio C Hamano 2025-10-29 18:21 ` Junio C Hamano 2025-10-27 0:44 ` [PATCH 13/14] rust: add a small wrapper around the hashfile code brian m. carlson 2025-10-28 18:19 ` Ezekiel Newren 2025-10-29 1:39 ` brian m. carlson 2025-10-27 0:44 ` [PATCH 14/14] object-file-convert: always make sure object ID algo is valid brian m. carlson 2025-10-29 20:07 ` [PATCH 00/14] SHA-1/SHA-256 interoperability, part 2 Junio C Hamano 2025-10-29 20:15 ` Junio C Hamano 2025-11-11 0:12 ` Ezekiel Newren 2025-11-14 17:25 ` Junio C Hamano 2025-11-14 21:11 ` Junio C Hamano 2025-11-17 6:56 ` Junio C Hamano 2025-11-17 22:09 ` brian m. carlson 2025-11-18 0:13 ` Junio C Hamano 2025-11-19 23:04 ` brian m. carlson 2025-11-19 23:24 ` Junio C Hamano 2025-11-19 23:37 ` Ezekiel Newren 2025-11-20 19:52 ` Ezekiel Newren 2025-11-20 23:02 ` brian m. carlson 2025-11-20 23:11 ` Ezekiel Newren 2025-11-20 23:14 ` Junio C Hamano 2025-11-17 22:16 ` [PATCH v2 00/15] " brian m. carlson 2025-11-17 22:16 ` [PATCH v2 01/15] repository: require Rust support for interoperability brian m. carlson 2025-11-17 22:16 ` [PATCH v2 02/15] conversion: don't crash when no destination algo brian m. carlson 2025-11-17 22:16 ` [PATCH v2 03/15] hash: use uint32_t for object_id algorithm brian m. carlson 2025-11-17 22:16 ` [PATCH v2 04/15] rust: add a ObjectID struct brian m. carlson 2025-11-17 22:16 ` [PATCH v2 05/15] rust: add a hash algorithm abstraction brian m. carlson 2025-11-17 22:16 ` [PATCH v2 06/15] hash: add a function to look up hash algo structs brian m. carlson 2025-11-17 22:16 ` [PATCH v2 07/15] rust: add additional helpers for ObjectID brian m. carlson 2025-11-17 22:16 ` [PATCH v2 08/15] csum-file: define hashwrite's count as a uint32_t brian m. carlson 2025-11-17 22:16 ` [PATCH v2 09/15] write-or-die: add an fsync component for the object map brian m. carlson 2025-11-17 22:16 ` [PATCH v2 10/15] hash: expose hash context functions to Rust brian m. carlson 2025-11-17 22:16 ` [PATCH v2 11/15] rust: add a build.rs script for tests brian m. carlson 2025-11-17 22:16 ` [PATCH v2 12/15] rust: add functionality to hash an object brian m. carlson 2025-11-17 22:16 ` [PATCH v2 13/15] rust: add a new binary object map format brian m. carlson 2025-11-17 22:16 ` [PATCH v2 14/15] rust: add a small wrapper around the hashfile code brian m. carlson 2025-11-17 22:16 ` [PATCH v2 15/15] object-file-convert: always make sure object ID algo is valid brian m. carlson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).